The System Design Courses

Go beyond memorizing solutions to specific problems. Learn the core concepts, patterns and templates to solve any problem.

Start Learning

Rate Limiter System Design

A rate limiter is a system designed to prevent a user from making too many requests to a service within a certain timeframe. This is used to protect services from abuse and can be implemented for various scopes, such as per user, per IP address, etc.

Functional Requirements

  • The client can be identified using either the cookie, user ID, or IP address.
  • If the throttling threshold is surpassed, the rate limiter will reject the request and send a "429 Too Many Requests" status code back to the client.
  • If the throttling threshold is not surpassed, the rate limiter will forward the request to the downstream server.
  • The rate limiter should "smooth out" traffic spikes. If there is a surge in traffic within a short period of time, the rate limiter should reject the surging requests.

Nonfunctional Requirements

Scale requirements:

  • 100 million daily active users
  • Data retention is not needed

Other requirements:

  • Scalability: The rate limiter should handle a large number of requests. As the user base grows, the system should scale accordingly.
  • Reliability: It should accurately track and limit requests without errors.
  • Availability: The rate limiter should be highly available; otherwise, it could become a single point of failure.
  • Consistency: It should ensure all nodes see the same request count for each client to prevent exceeding the limit.
  • Latency: The overhead added by the rate limiter should be minimal, to not affect the system's performance.
  • Efficiency: It should have optimized resource usage to handle a high volume of requests.
  • Privacy: No specific privacy requirements for rate limiting, as it doesn't require storing sensitive data.

Resource Estimation

Assumptions

  • Each user makes 1,000 requests per day.
  • Each request and response are 1 KB in size.
  • We need to store each user's remaining capacity and the timestamp of their last request. Let's say that takes 50 bytes.

Traffic Estimates: For 100 million daily active users, each making 1,000 requests, if we assume an even distribution, that would roughly be around 1,150,000 requests per second.

Storage Estimates: N/A since data retention is not needed as stated in the requirement.

Bandwidth Estimates: If every write operation has an average size of 500 bytes, with 1 million write operations daily, we will have approximately 500MB of incoming data daily. For the read requests, given the 100:1 read to write ratio, there will be 100 million read requests daily. Therefore, the service will serve approximately 50GB of data daily for the read requests.

Memory Estimates for Cache: If we want to cache some of the hot URLs that are frequently accessed, and let's say we want to cache 20% of daily read requests, we would need to cache 20 million URLs. Assuming each URL takes 500 bytes in memory, we would need about 10GB of memory.

API Design

Our system can expose the following REST APIs:

POST /shouldAllowRequest

Request body:

{ "clientId": "123", "timestamp": "2023-07-13T07:20:50.52Z" }

Response:

{ "allowed": true }

or

{ "allowed": false }

High-Level Design

Design Rate Limiter

  1. Incoming Request: The data flow starts when the client (e.g., a user's web browser) makes a request to the service. This request will include some form of client identifier, such as an IP address or user ID, so the rate limiter can track requests on a per-client basis.

  2. Load Balancer: The request first hits the Load Balancer, which routes it to one of the Rate Limiter nodes. If we use sticky sessions based on the client identifier, all requests from the same client will be routed to the same Rate Limiter node.

  3. Rate Limiter Node and In-memory Data Store: The Rate Limiter node receives the request and first checks its local cache (e.g., Redis) to retrieve the current count of requests for this client. If the client is not in the cache (which would be the case for the client's first request), the count is initialized to 1. If the client is in the cache, the count is incremented. Rate Check: The Rate Limiter then checks if the incremented count exceeds the rate limit. If it does, the Rate Limiter responds to the client's request with an error message indicating that they've exceeded their rate limit.

  4. Forward to Service: If the count does not exceed the rate limit, the Rate Limiter forwards the request to the service. If the count does exceed the rate limit, the Rate Limiter returns 429 to the load balancer to return to the client.

Finally, the service processes the request and generates a response, which is sent back to the Rate Limiter. The Rate Limiter node then forwards the service's response back to load balancer and to the client.

Detailed Design

Which Rate Limiting Algorithm Should We Use?

The choice of a rate-limiting algorithm often depends on your specific use case and system requirements, including factors such as how strictly you need to enforce the rate limits and how much complexity you're willing to manage. Here are a few common rate-limiting algorithms and their trade-offs:

Fixed Window Counter

This simple algorithm allows a fixed number of requests in each time window, such as 1000 requests per hour. The problem with the fixed window is that it does not prevent traffic spikes. For example, if a user makes all 1000 requests at the end of one hour and another 1000 at the start of the next hour, the system experiences a spike of 2000 requests. Here's the sample implementation for Fixed Window Counter:

counter = 0 WINDOW_SIZE = 3600 # 1 hour in seconds RATE_LIMIT = 1000 def allow_request(): global counter current_time = time.time() # Reset the counter if we're in a new time window if current_time > window_start_time + WINDOW_SIZE: counter = 0 window_start_time = current_time # Check if the request can be allowed if counter < RATE_LIMIT: counter += 1 return True else: return False

Sliding Window Log

This algorithm keeps a log of all the requests from the past window and checks the number of requests before allowing a new one. This eliminates the problem of traffic spikes seen in the fixed window counter but requires more storage and computation since you're maintaining a log of requests. Here's a sample implementation of Sliding Window Log:

log = [] WINDOW_SIZE = 3600 # 1 hour in seconds RATE_LIMIT = 1000 def allow_request(): current_time = time.time() # Remove outdated requests from the log while log and current_time - log[0] > WINDOW_SIZE: log.pop(0) # Check if the request can be allowed if len(log) < RATE_LIMIT: log.append(current_time) return True else: return False

Token Bucket

This algorithm works like an actual bucket holding tokens. The bucket has a certain capacity, and tokens are added to the bucket at a fixed rate up to the maximum capacity. When a request arrives, the rate limiter tries to remove a token from the bucket. If a token is available (i.e., if the bucket is not empty), the request is allowed; otherwise, the request is rejected. The key characteristic of the Token Bucket algorithm is that it allows for burstiness as long as the average rate of requests doesn't exceed the token refill rate. In other words, it can handle a sudden influx of requests by using the tokens stored in the bucket.

tokens = 0 LAST_REQUEST_TIME = 0 TOKEN_RATE = 1 # 1 token per second BUCKET_SIZE = 1000 def allow_request(): global tokens, LAST_REQUEST_TIME current_time = time.time() # Refill tokens based on the time passed tokens += (current_time - LAST_REQUEST_TIME) * TOKEN_RATE tokens = min(tokens, BUCKET_SIZE) LAST_REQUEST_TIME = current_time # Check if the request can be allowed if tokens >= 1: tokens -= 1 return True else: return False

Leaky Bucket

This algorithm also uses a bucket analogy, but in this case, the bucket leaks. Requests fill the bucket, and they leak out at a fixed rate. If the incoming requests arrive at a faster rate than the leak rate and the bucket fills up, then additional requests are rejected. The Leaky Bucket algorithm is designed to smooth out the bursty traffic; it enforces a constant output rate regardless of the size of bursts.

remaining_capacity = 1000 # Bucket capacity LAST_REQUEST_TIME = 0 LEAK_RATE = 1 # 1 request per second def allow_request(): global remaining_capacity, LAST_REQUEST_TIME current_time = time.time() # Leak requests based on the time passed remaining_capacity += (current_time - LAST_REQUEST_TIME) * LEAK_RATE remaining_capacity = min(remaining_capacity, BUCKET_SIZE) LAST_REQUEST_TIME = current_time # Check if the request can be allowed if remaining_capacity >= 1: remaining_capacity -= 1 return True else: return False

Here's a table comparing the four rate limiting algorithms:

AlgorithmDescriptionProsCons
Fixed WindowDivides time into fixed-size windows and allows a certain number of requests in each window.Simple and easy to implement.Allows bursty traffic at the edge of windows, which could overload the system.
Sliding Window LogKeeps a log of all requests in the current window and previous windows.Smoothes out bursty traffic better than the Fixed Window algorithm.Requires more storage and computational resources because it keeps a log of all requests.
Token BucketEach user (or client IP, etc.) has a "bucket" of tokens, and each request costs a token. Tokens are added to the bucket at a fixed rate.Allows for occasional bursts of traffic.May be more complex to implement, especially in a distributed system.
Leaky BucketSimilar to Token Bucket, but instead of adding tokens, it "leaks" requests at a fixed rate. If the bucket is full (i.e., the leak rate is exceeded), incoming requests are rejected.Smooths out traffic effectively, preventing bursts.Can lead to high rejection rates if the traffic is naturally bursty.

The best algorithm to use depends on the specifics of the system and the nature of the traffic. In general, Fixed Window is a good choice for simple systems or for getting started quickly. Sliding Window Log is a good choice for systems that need to smooth out bursty traffic but can handle the additional resource requirements. Token Bucket and Leaky Bucket are good choices for systems that need more control over the traffic pattern.

Considering the functional requirement of smoothing out traffic surge and the non-functional requirement of efficiency, we can choose to use the Leaky Bucket algorithm for our design.

Data Store

Database Type

Given the nature of the data we are dealing with (mostly key-value pairs for users and their request counts), and the requirement for high speed reads and writes, a NoSQL database would be a good fit. Specifically, an in-memory data store like Redis could be used due to its high performance and features such as atomic operations and TTL (time-to-live) on keys, which can be handy for implementing the rate limiter.

Database Partitioning

Given the massive scale of the system (100 million daily active users), we would need to partition or shard our data to distribute it across multiple databases or servers for improved performance and scalability. There are several partitioning methods, but for our use case, a simple partitioning scheme such as consistent hashing could be used. Consistent hashing would distribute our users evenly across the database nodes and minimize data movement when nodes are added or removed.

Database Replication

Since data retention is not a requirement, and our rate limiting doesn't need to be super strict or fair, you can afford to lose some data in the event of a node going down. Also, without the need for long-term data storage (retention requirement is 0 day), the cost and complexity of replication might not be justified.

In your scenario, each node could independently rate limit requests based on its own data. If a node fails, the clients it was serving would be redistributed among the remaining nodes. Their request counts would effectively be reset, as the new node would have no knowledge of their previous requests. Depending on your system's rate limits and the frequency of node failures, this might result in some clients being able to exceed their rate limits, but as you've mentioned, this isn't a major concern in your case.

Data Retention and Cleanup

In the context of rate limiting, data retention isn't typically a major concern as we only need to track the user requests for a short period (the duration of the rate limit window). We do not need to keep this data once it's no longer relevant. Redis provides a feature called TTL (time-to-live), which automatically removes keys after a specified duration. We can set the TTL of each user key to the window size of our rate limiter. This way, the data would automatically be cleaned up by Redis once it's no longer relevant. This not only saves space but also simplifies the system as we do not have to manage the cleanup process ourselves.

Concurrency

Concurrency is a significant concern when designing a system like a rate limiter. It should accurately track and limit requests without errors.

Let's first understand what the problem could be. Consider a scenario where a user sends two requests at the same time, and these requests are processed concurrently by the server. Both requests fetch the user's current request count (let's say it's 9 and the limit is 10), increment it (both get a new count of 10), and then save it back to the database. The user's final request count in the database is 10, but it should be 11. Hence, the user has effectively exceeded their rate limit without the system noticing.

Here's how we can address this concurrency issue:

Atomic Operations: In Redis, you can use multi/exec commands to ensure that all commands in the block are executed sequentially without any interference from other clients, which effectively solves the concurrency problem. Similarly, most databases support transactions that allow you to perform multiple operations as a single, atomic operation. Here's an example of how you could use Redis' multi/exec commands:

MULTI INCR user:request_count EXPIRE user:request_count 60 EXEC

In this example, the INCR and EXPIRE commands will be executed as a single, atomic operation. If two requests come in at the same time, one will have to wait for the other to finish its multi/exec block before it can start its own, ensuring the request count is updated correctly. Distributed Locks: In a distributed system where multiple nodes might be updating a user's request count, you could use a distributed lock to ensure that only one node can update the count at a time. This is a more complex solution and could impact performance due to the time taken to acquire and release locks, so it's generally only used when necessary. Sharding: Distributing requests from the same user to the same server (also known as sticky sessions) can also help with concurrency. If all requests from the same user go to the same server, then that server can use local locks or single-threaded operations to ensure the request count is updated correctly.

Analytics

Similar to the URL shortener problem, we can add an analytics service and decouple the rate limiting from the analytics using a message queue, ensuring that any potential issues with the analytics service won't impact the performance of the rate limiter. The rate limiter can quickly push the data about a dropped request to the queue and continue processing new requests.

Here's how we can modify our system to incorporate this design:

Design Rate Limiter with Analytics

Rate Limiter Node: Once a request exceeds the rate limit and is dropped, the rate limiter node sends a message to a message queue. The message contains information about the request, such as the client identifier, timestamp, and possibly the request details.

Message Queue: A scalable, distributed message queue such as Kafka or RabbitMQ can be used. The queue temporarily stores the messages until they are consumed by the analytics service. If a high volume of messages is expected, the queue can be set up with multiple partitions to distribute the messages and enable parallel processing.

Analytics Service: The analytics service reads the messages from the queue, processes them, and then stores the data for further analysis. The service can be designed to handle a batch of messages at a time, which is more efficient than processing one message at a time.

Sample Implementation

import time import redis r = redis.Redis() BUCKET_SIZE = 1000 # Maximum capacity of the bucket LEAK_RATE = 1 # Leak rate in requests per second def allow_request(user_id): current_time = time.time() # Fetch the stored data for the user remaining_capacity = r.hget(user_id, 'remaining_capacity') last_request_time = r.hget(user_id, 'last_request_time') # If this is the user's first request, initialize their data if remaining_capacity is None or last_request_time is None: remaining_capacity = BUCKET_SIZE last_request_time = current_time else: remaining_capacity = float(remaining_capacity) last_request_time = float(last_request_time) # Leak requests based on the time passed time_passed = current_time - last_request_time leaked_requests = time_passed * LEAK_RATE remaining_capacity = min(remaining_capacity + leaked_requests, BUCKET_SIZE) # Check if the request can be allowed if remaining_capacity >= 1: # If so, decrement the remaining capacity and update the last request time r.hmset(user_id, { 'remaining_capacity': remaining_capacity - 1, 'last_request_time': current_time }) return True else: # If not, update the last request time but don't decrement the capacity r.hset(user_id, 'last_request_time', current_time) return False

The allow_request function checks if a request from a user should be allowed based on the Leaky Bucket algorithm. The Redis HGET command is used to retrieve the remaining capacity and the time of the last request for the user. If the user doesn't have any stored data (which would be the case for their first request), these values are initialized to the bucket size and the current time, respectively.

The function then calculates the number of requests that should have leaked from the bucket based on the time passed since the last request. It updates the remaining capacity accordingly, ensuring it doesn't exceed the bucket size.

Finally, the function checks if the updated remaining capacity is sufficient to allow the current request. If so, it decrements the remaining capacity by 1 and updates the time of the last request. If not, it only updates the time of the last request. The updated values are then stored in Redis using the HMSET and HSET commands.

The System Design Courses

Go beyond memorizing solutions to specific problems. Learn the core concepts, patterns and templates to solve any problem.

Start Learning

System Design Master Template

Comments

mrwarpigs
Noticed the requirements for the practice especification differs from the solution. In the solution there is no data retention, while it happens in the practice. No big deal though.
Tue Jun 11 2024
1