Solution

Yelp System Design

Design an online platform that supplies geo-based search and publishes crowd-sourced reviews about businesses similar to Yelp. Yelp.com is a popular platform for users to find and review new places to eat, shop, drink, relax, and play.

Search in Yelp

Functional Requirements

  • Business Information: Develop a comprehensive business profile feature that allows users to view essential details such as location, contact information, operating hours, services offered, and photos.
  • Search: Create a robust search engine that enables users to find businesses based on various filters like location, category, ratings, and specific services or amenities.
  • Reviews and Ratings: Design an interactive review system where users can write detailed reviews, rate businesses, and even update their reviews. The system should prevent spam and abuse.

Non-Functional Requirements

  • Scalability: The system should be able to scale horizontally to accommodate increasing loads and spikes in user activity without degradation in performance.
  • High Availability: Implement redundancy and failover strategies to ensure that the system remains operational even in the event of component failures.
  • Low Latency: Optimize database queries, use efficient caching mechanisms, and minimize network latency to ensure a fast user experience.
  • High Durability: Employ replication and backup strategies to ensure that data is not lost and can be recovered in case of a system failure.

Resource Estimation

Given that the DAU (Daily Active Users) is 1M, and each user performs 1 write operation every 10 days, the daily number of write requests is: 1M / 10 = 100K.

Additionally, with a Read:Write ratio of 100:1, the daily number of read requests is: 1M / 10 * 1000 = 100M. Therefore, QPS = 100M / (24 * 3600) ≈ 1157

Given that each review is 1KB in size, the daily increase in data volume is: 1M / 10 * 1KB. To retain data for 5 years, the required storage space is: 1M / 10 * 1KB * 365 * 5 ≈ 177GB

Yelp Resource Estimation

API Endpoint Design

  • GET /search?lon={longitude}&lat={latitude}&radius={radius}&keyword={keyword} searches for businesses within a circular area defined by the given longitude and latitude as the center and the specified radius, containing the keyword. The response body as bellow:
{ "status": 200, // number "business": [...] // business ids array }
  • POST /reviews: Enables users to submit a review for a business, which includes a rating and text content. The request body as bellow:
{ "business_id": "The id of the business", "rating": 4.6, // number "content": "The content of" }
  • GET /business/{business_id}: Retrieves detailed information about a business, including user reviews, ratings, and photos. The response body as bellow:
{ "business_id": "The id of the business", "latitude": 37.4219407, // double "longitude": -122.0019211, // double "name": "The business name", "description": "The description of the business", "street": "123 Main St", "city": "Metropolis", "state": "NY", "zip_code": "12345" "rating": 4.6 // number }
  • GET /business/{business_id}/reviews?page={page_number}&limit={limit_per_page}&last_timestamp={last_page_timestamp}: Retrieves a paginated list of reviews for a specific business.

The updated response JSON structure with pagination might look like this:

{ "status": 200, "current_page": 1, "per_page": 10, "total_pages": 5, "total_reviews": 50, "reviews": [ { "review_id": "The unique id of the review", "user_id": "The unique id of the user who wrote the review", "rating": 4.5, "content": "The written content of the review", "timestamp": "The date and time when the review was posted" }, { "review_id": "Another unique id of a review", "user_id": "The unique id of another user", "rating": 3.8, "content": "The written content of another review", "timestamp": "The date and time when this review was posted" } // ... more reviews ] }

In this design, current_page indicates the page number of the current set of results, per_page is the number of reviews returned per page, total_pages is the total number of pages available, and total_reviews is the total number of reviews for the business.

High-Level Design

The system architecture will include a set of microservices, each responsible for a specific domain: Business Service for managing business data, Review Service for reviews and ratings, and Search Service for processing search queries. These services will communicate via RESTful APIs or message queues for asynchronous tasks. A Load Balancer will distribute incoming requests to appropriate service instances, which can scale out based on demand.

Alt text

The number of locations is vast, but the frequency of updates is low. We will store the data in a cache, and when processing requests, we will read from the cache to speed up the system's response.

In the diagram above, we can use Change Data Capture (CDC) to import data from the Business Database into Elasticsearch, which supports the Search Service for full-text searches. This detailed integration process, focusing on using CDC with Elasticsearch for improved search capabilities, is beyond this document's scope but can be further explored in Database → Change Data Capture (CDC) → Elasticsearch: Unlocking Search, Analytics, and Real-Time Responsiveness.

Detailed Design

Data Store

Database Type

  • User and Review Data: In light of the Resource Estimation section, the system is expected to handle about 100K write operations daily, primarily from user-submitted reviews. Assuming an average size of 1KB per review, this translates to a data growth of roughly 100MB per day. This volume is within the manageable range for both modern SQL and NoSQL database systems. Moreover, read requests primarily access the cache, minimizing the impact on the database.

    On the other hand, the need for consistency in user and review data hinges on the specific business logic of the application. For review systems, eventual consistency is often deemed acceptable, as the immediate visibility of user-submitted reviews across all nodes is not typically critical.

    Considering both the data volume and the need for consistency, both SQL and NoSQL databases can meet the system's requirements.

  • Business(locations) Data: Employ a NoSQL database like Cassandra or MongoDB for business data due to its scalability and flexibility to handle unstructured data like photos and varied business attributes.

Search Engine

Elasticsearch will serve as the primary search engine due to its powerful full-text search capabilities, geo-spatial search features, and scalability to handle large datasets with quick response times.

Tools like Elasticsearch and Redis already support geo-based searches, so using these tools can fulfill the business requirements. However, such an answer might not be what the interviewer wants to hear. Therefore, the Geo-spatial Search and QuadTrees Implementation section introduces how to implement geo-based search without using tools like Elasticsearch and Redis.

Data Schema

Relational Database Schema (e.g., PostgreSQL)

Users Table

CREATE TABLE Users ( user_id SERIAL PRIMARY KEY, username VARCHAR(255) NOT NULL, email VARCHAR(255) UNIQUE NOT NULL, password_hash VARCHAR(255) NOT NULL, join_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, last_login TIMESTAMP, profile_picture_url VARCHAR(255), -- Additional fields like bio, location, etc. );

Businesses Table

CREATE TABLE Businesses ( business_id SERIAL PRIMARY KEY, owner_id INT REFERENCES Users(user_id), name VARCHAR(255) NOT NULL, description TEXT, city VARCHAR(255) NOT NULL, state VARCHAR(255) NOT NULL, zip_code VARCHAR(10) NOT NULL, latitude FLOAT NOT NULL, longitude FLOAT NOT NULL, category VARCHAR(255) NOT NULL, -- Additional fields for services, amenities, etc. );

Reviews Table

CREATE TABLE Reviews ( review_id SERIAL PRIMARY KEY, business_id INT REFERENCES Businesses(business_id), user_id INT REFERENCES Users(user_id), rating DECIMAL(2, 1) CHECK (rating >= 0 AND rating <= 5), content TEXT, review_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, -- Additional fields like photos, response from business, etc. );

Ratings Table (Optional)

-- This table can be used if you want to separate ratings from reviews CREATE TABLE Ratings ( rating_id SERIAL PRIMARY KEY, business_id INT REFERENCES Businesses(business_id), user_id INT REFERENCES Users(user_id), rating DECIMAL(2, 1) CHECK (rating >= 0 AND rating <= 5), rating_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP );

NoSQL Database Schema (e.g., MongoDB)

For the Business data, a NoSQL document-based schema can be used. Below is an example of how a business document might look in MongoDB:

Business Document

{ "_id": ObjectId("unique_business_id"), "owner_id": "owner_user_id", "name": "Business Name", "description": "Business Description", "address": { "street": "123 Main St", "city": "Metropolis", "state": "NY", "zip_code": "12345" }, "geo": { "type": "Point", "coordinates": [-122.0019211, 37.4219407] // [longitude, latitude] }, "category": "Food", "services": ["Delivery", "Takeout"] }

Database Partitioning (Sharding)

Business Data: Since business data can be vast, we can shard it based on geographic location, such as city or state. This approach ensures that queries for businesses are fast because they are likely to be localized and can be directed to the shard containing data for that specific location.

User Data: For sharding user data, a straightforward approach based on the user_id can be employed. By dividing the user_id space into ranges or using a hash function to partition user_ids across different shards, we can ensure an even distribution of data. This method facilitates load balancing and supports horizontal scaling by distributing the workload evenly across multiple database instances. Each shard can operate somewhat independently, which enhances the performance and scalability of the system. This strategy is particularly effective for user data, as it allows for efficient queries and operations that are localized to specific shards, thereby reducing cross-shard communication and improving overall system performance.

Review Data: Reviews can be sharded based on the business_id, ensuring that all reviews for a particular business are stored on the same shard. This collocation optimizes the performance of queries that fetch reviews for a business.

For more detailed content, refer to Partitioning (Sharding).

Cache

Cache Granularity

Object Caching: Cache business objects, user profiles, and reviews as whole objects. This approach simplifies the caching logic but may consume more memory.

Query Result Caching: Cache the results of common queries, such as business searches within a certain radius. This is highly effective for read-heavy operations.

Distributed Caching

Use a distributed cache like Redis or Memcached, which can scale out by adding more nodes to the cache cluster. This allows the cache to handle large volumes of data and a high number of concurrent users.

Cache Invalidation

Implement a cache invalidation strategy to ensure that stale data is removed from the cache. This can be done using time-to-live (TTL) values or by explicitly invalidating cache entries when the underlying data changes.

For additional information on caching, please refer to our article on Caching.

Geo-spatial Search and QuadTrees Implementation

QuadTrees are a type of data structure that is particularly well-suited for spatial indexing, which is the process of organizing data according to their geographic locations. This makes them an excellent choice for implementing location-based search in a system like Yelp, where users need to find businesses near a certain location. Unlike a naive approach that might involve dividing the space into uniform squares for search purposes, QuadTrees offer a dynamic and efficient solution. They adapt to the distribution of data points by subdividing the space into smaller regions only where the density of points is high. This adaptability reduces space wastage and speeds up the search process by quickly eliminating irrelevant areas.

However, it's important to note that in real-world applications, Geo-hashing is often preferred for its simplicity and efficiency. Geo-hashing converts geographic locations into a short string of characters and numbers, facilitating fast and accurate proximity searches. It simplifies handling spatial data in distributed systems due to its straightforward string-based representation. While QuadTrees provide a powerful method for spatial indexing with their dynamic subdivision of space, Geo-hashing is frequently used for its scalability and ease of integration into existing systems.

By comparing QuadTrees and Geo-hashing, we see that each has its advantages. QuadTrees excel in scenarios where the spatial distribution of data points is highly uneven, allowing for more efficient data organization and search. On the other hand, Geo-hashing offers a more uniform approach to spatial indexing, making it particularly useful for distributed systems where simplicity and speed are paramount. Depending on the specific requirements and constraints of the system being designed, either approach—or even a combination of both—could be employed to achieve efficient location-based search functionality.

In this document, we focus on implementing QuadTrees to demonstrate a dynamic and efficient method for spatial indexing.

Alt text

Basic Concept

A QuadTree is a tree data structure in which each internal node has exactly four children. It is used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. The regions may represent different levels of detail at different levels of the tree.

Alt text

How QuadTrees Work

  1. Initialization: The entire map is considered a square and represented as the root of the QuadTree. This square is then divided into four equal-sized quadrants.

  2. Insertion: Each business location is inserted into the QuadTree. The insertion starts from the root node. If a node represents a region that should be subdivided (because it contains more than one point or exceeds a certain density), that node is split into four child quadrants, and the points it contains are distributed among these children. This process continues recursively.

  3. Searching: To find all businesses within a certain radius of a point, the search begins at the root node and recursively visits nodes that intersect with the search area. If a node's region lies entirely within the search area, all businesses in that node are part of the result set. If a node's region does not intersect at all with the search area, it and its children are ignored.

Implementing QuadTrees for Yelp

  1. Data Structure: Define the QuadTree node structure with properties for bounds (representing the spatial region), points (businesses within the region), and children (four child nodes).
class QuadTreeNode: def __init__(self, bounds): self.bounds = bounds # The spatial region this node covers self.points = [] # Businesses within this region self.children = [] # Four children, one for each quadrant
  1. Insert Businesses: When a new business is added, insert it into the QuadTree based on its geographic coordinates. If a node exceeds its capacity, subdivide it and redistribute the businesses among the new child nodes.
def insert(self, business): if self.should_subdivide(): self.subdivide() # Determine which child quadrant the business belongs to and insert it there. # If the node is a leaf and has not reached capacity, add the business to the points list.
  1. Search Query: When a user searches for businesses within a certain radius of their location, perform a range search on the QuadTree. This will efficiently return all businesses within the search area.
def search(self, center, radius): results = [] # Check if the search area intersects with the node's bounds. if self.bounds.intersect(center, radius): # Add businesses within the search area to the results. results.extend([business for business in self.points if business.is_within(center, radius)]) # Recursively search child nodes. for child in self.children: results.extend(child.search(center, radius)) return results
  1. Optimization: To optimize the search, consider adding additional spatial indexes or heuristics that can skip unnecessary branches of the tree, further reducing the search space.

Challenges and Considerations

  • Dynamic Updates: QuadTrees need to handle updates efficiently, such as when businesses close or change locations. This may require removing and re-inserting points, which can be computationally expensive.

  • Load Balancing: In a real-world scenario, business locations are not evenly distributed. Some areas may have a high density of businesses, leading to deep QuadTree nodes. Balancing the tree to prevent very deep or unbalanced trees is important for performance.

  • Memory Usage: Each node in the QuadTree stores a list of points and child nodes, which can lead to high memory usage. Pruning and garbage collection strategies may be necessary to manage memory.

  • Concurrency: In a high-traffic system like Yelp, multiple read and write operations may occur simultaneously. Implementing concurrency control, such as read-write locks, can ensure data consistency.

By using QuadTrees for spatial indexing in the Yelp-like system, the platform can provide efficient and fast location-based searches.

Follow up Detailed Design Questions and Answers

How can we implement location-based search to allow users to find businesses near their current location?

In the "Geo-spatial Search and QuadTrees Implementation" section, we explored the algorithmic foundation of implementing geo-spatial searches, detailing how QuadTrees and Geo-hashing can be utilized for efficient spatial indexing and location-based searches. Building on that theoretical groundwork, this section focuses on established third-party tools that offer robust support for geo-spatial searches, leveraging the principles of QuadTrees, Geo-hashing, or similar spatial data structures to provide scalable and efficient location-based search capabilities.

Implementing Location-Based Search with Third-Party Tools

Grasping the building blocks ("the lego pieces")

This part of the guide will focus on the various components that are often used to construct a system (the building blocks), and the design templates that provide a framework for structuring these blocks.

Core Building blocks

At the bare minimum you should know the core building blocks of system design

  • Scaling stateless services with load balancing
  • Scaling database reads with replication and caching
  • Scaling database writes with partition (aka sharding)
  • Scaling data flow with message queues

System Design Template

With these building blocks, you will be able to apply our template to solve many system design problems. We will dive into the details in the Design Template section. Here’s a sneak peak:

System Design Template

Additional Building Blocks

Additionally, you will want to understand these concepts

  • Processing large amount of data (aka “big data”) with batch and stream processing
    • Particularly useful for solving data-intensive problems such as designing an analytics app
  • Achieving consistency across services using distribution transaction or event sourcing
    • Particularly useful for solving problems that require strict transactions such as designing financial apps
  • Full text search: full-text index
  • Storing data for the long term: data warehousing

On top of these, there are ad hoc knowledge you would want to know tailored to certain problems. For example, geohashing for designing location-based services like Yelp or Uber, operational transform to solve problems like designing Google Doc. You can learn these these on a case-by-case basis. System design interviews are supposed to test your general design skills and not specific knowledge.

Working through problems and building solutions using the building blocks

Finally, we have a series of practical problems for you to work through. You can find the problem in /problems. This hands-on practice will not only help you apply the principles learned but will also enhance your understanding of how to use the building blocks to construct effective solutions. The list of questions grow. We are actively adding more questions to the list.

Read the rest of this article and practice this problem with a FREE account