Solution

Design Twitter

We aim to design a simplified version of Twitter, a popular social media platform where users can post tweets, follow or unfollow other users, and view the tweets of the people they follow. The platform also includes a recommendation algorithm that suggests content to users based on their preferences and interactions.

Functional Requirements

  1. Tweeting: Users should be able to write and post new tweets.
  2. Follow/Unfollow: Users should have the ability to follow or unfollow other users.
  3. Timeline: Users should be able to view a list of tweets from the people they follow, as well as content recommended by the recommendation algorithm.

Non-Functional Requirements

  • 300M DAUs
  • Each tweet is approximately 140 characters (or 280 bytes)
  • Retain data for five years.
  • Assuming each user posts one tweet per day.
  • High availability
  • Low latency
  • High durability
  • Security

Resource Estimation

Assuming a read-write ratio of 100:1.

Using the resource estimator, we get the following results:

Resource Estimation

API Endpoint Design

The API endpoints could include:

  • POST /tweets for creating a new tweet. Request body as below:
    { "content": "The content of the new tweet." }
  • GET /tweets/{userId}?last={timestamp}&size={size} for retrieving a user's tweets.
  • POST /follow/{userId} for following a user.
  • DELETE /follow/{userId} for unfollowing a user.
  • GET /timeline?last={timestamp} for retrieving timeline tweets.

High-Level Design

Twitter is the perfect example of designing scalable system using our system design template.

The system could be designed with distinct components, specifically the Tweet Service, Follow Service, and Timeline Service. The Design Diagram is as below:

Twitter System Design Diagram

The Tweet Service and Follow Service are used to handle requests for sending tweets, following, and unfollowing. Considering the need for response speed to the client and support for high concurrency, these two services do not directly write into the database, but write the request data into the Message Queue. The Database Updater component reads the request data from the Message Queue, genuinely processes the business logic, writes into the server, and updates the cache.

The Timeline Service is used to handle requests for loading Twitter lists. Based on considerations of response speed and improving system efficiency, this service reads data from the cache, rather than accessing the database.

Fan-out-on-write

Each user has their own "inbox" inside the cache that stores the tweets to be displayed in its timeline. When a user it follows posts a tweet, the tweet is sent to its "inbox". This is often called "fan-out-on-write" because it replicates ("fans out") a piece of information to multiple destinations at the time of its creation or update. The advantage of this is it significantly enhances the read performance since the tweets are already present in a timeline cache when a user logs in. It reduces the need for complex and time-consuming queries at the time of read. However, for celebrities with millions of followings this could present a problem as the write would be quite large. This is a common follow-up question in Design Twitter interviews. We will cover it in the follow-up question section.

Detailed Design

Database Type

Considering the scale requirement of 300M DAU and assuming that each user sends one tweet per day, this would generate 300M tweets per day, which is a tremendous amount of data. At the same time, this system does not have complex query requirements. Considering these two points, NoSQL could be used as the database.

A NoSQL database like Cassandra could be used due to its ability to handle large amounts of data and its high write speed.

Data Schema

The data schema could include a Users table, a Tweets table, and a Follows table. The Users table would store user information, the Tweets table would store tweets, and the Follows table would store information about who each user is following.

Here is the table structure, designed with Cassandra as the database:

Users Table

FieldType
UserIDUUID PRIMARY KEY
UserNameText
UserEmailText
UserPasswordText

Tweets Table

FieldType
TweetIDUUID PRIMARY KEY
UserIDUUID
TweetContentText
TweetTimestampTimestamp

Follows Table

FieldType
FollowerUserIDUUID
FollowedUserIDUUID
FollowIDUUID PRIMARY KEY

In this design, we use UUID as the primary key because they can generate unique IDs without the need for a central coordinator. This is very useful for distributed systems.

In addition, we need to create additional tables to optimize our queries. For example, if we often need to query all tweets from a user, we might need to create a table with UserID and TweetTimestamp as the primary key. This way, we can quickly get all tweets from a user.

Database Partitioning

The database could be partitioned based on user ID to distribute the data evenly and reduce the load on any single node.

Using user ID as the field for Database Partitioning can lead to the 'celebrity problem'. The resolution of this issue requires the introduction of a more complex partitioning strategy. Further discussion on this will be conducted in the Follow Up Detailed Design Questions and Answers section.

Database Replication

Distributed NoSQL databases usually have a certain degree of data redundancy and failover capabilities, and can automatically recover when nodes fail. Therefore, the main consideration for data backup should be to protect data against extreme situations such as physical damage to the data center, network attacks, or human errors.

For these extreme situations, here are some data backup strategies:

  1. Regular Backups: Perform full and incremental backups regularly. A full backup involves backing up all data, while an incremental backup involves backing up only the data that has changed since the last full or incremental backup.

  2. Offline Backups: Store backup data in a location separate from the production environment. This way, even if the production environment is attacked or damaged, the backup data remains safe.

  3. Geographical Distribution: Store backup data in different geographical locations. This way, even if a disaster occurs at one location (such as a fire or flood), the backup data at other locations can still be used.

  4. Test Backups: Regularly test the process of restoring backup data to ensure that data can be successfully restored when needed.

  5. Version Control: Keep multiple versions of backups. This way, if the latest backup has a problem, you can still use an older backup.

Data Retention and Cleanup

Tweets that are older than five years could be archived or deleted to free up storage space.

Cache

Caching is an essential aspect of our system design to ensure low latency and high availability. We can use an in-memory database like Redis to store the most recent or most frequently accessed tweets. This will help to reduce the load on our primary database and improve the speed of read operations.

Here are some caching strategies we might use:

  1. Caching User Timelines: We can cache users' timelines, which include the tweets from the people they follow. When a user requests their timeline, we can first check the cache. If the cache has the data, we can return it directly, reducing the need for a database query. If the cache doesn't have the data, we can query the database, update the cache, and then return the data.

  2. Caching Popular Tweets: Tweets that are frequently accessed can be cached. This is particularly useful for popular tweets that are liked, retweeted, or replied to by many users.

  3. Caching User Profiles: We can also cache user profile data, including their tweets, followers, and who they are following. This can make operations like viewing a user's profile or checking who a user is following much faster.

  4. Eviction Policies: Since the cache has limited size, we need to decide how to remove items when the cache is full. We can use policies like Least Recently Used (LRU), where we remove the items that haven't been accessed for the longest time.

  5. Cache Refreshing: We need to decide how to keep the cache up-to-date. One approach is to use a write-through cache, where we update the cache every time we update the database. Another approach is to use a time-to-live (TTL) strategy, where items in the cache are automatically removed after a certain period of time.

  6. Cache Partitioning: To handle large amounts of data and high traffic, we can partition the cache across multiple servers. This can be done based on a hash of the data key, which distributes the data evenly across the servers.

  7. Replication and Persistence: To prevent data loss in case of a cache failure, we can replicate the cache data across multiple servers. We can also periodically persist the cache data to a disk.

  8. Cache Consistency: To ensure data consistency between the cache and the primary database, we can use strategies like read-through, write-through, write-around, or write-back caching.

By implementing these caching strategies, we can significantly improve the performance and user experience of our Twitter-like service.

For more information about caching, not just Loading Pattern and Eviction Policy, detailed explanations are provided in the Caching section.

Analytics

The system should include an analytics component that collects and processes data to provide insights into user behavior and improve the recommendation algorithm. The analytics component can help to understand the users' interests, their interaction with the platform, and the performance of the recommendation algorithm.

Here are some key aspects that the analytics component could focus on:

  1. User Behavior Analysis: This involves tracking and analyzing the actions of users on the platform. For example, what tweets they like or retweet, who they follow or unfollow, how often they tweet, etc. This data can help to understand the users' interests and preferences, which can be used to improve the recommendation algorithm.

  2. Content Analysis: This involves analyzing the content of the tweets. For example, what topics are trending, what hashtags are used, what links are shared, etc. This data can help to understand what content is popular and engaging, which can be used to improve the recommendation algorithm.

  3. Performance Analysis: This involves tracking and analyzing the performance of the recommendation algorithm. For example, how often users click on recommended content, how long they spend on recommended content, etc. This data can help to understand how well the recommendation algorithm is working and where it can be improved.

  4. A/B Testing: This involves testing different versions of the recommendation algorithm to see which one performs better. For example, version A might recommend content based on the user's past behavior, while version B might recommend content based on the user's social connections. By comparing the performance of the two versions, we can determine which approach is more effective.

  5. Real-Time Analytics: This involves processing and analyzing data in real-time. For example, tracking trending topics, detecting spam or abusive behavior, etc. Real-time analytics can help to respond quickly to changes and issues on the platform.

To implement these analytics capabilities, we might use a combination of tools and technologies. For example, we might use a data processing framework like Apache Hadoop to process large amounts of data, a data warehouse like Google BigQuery to store and analyze data, and a data visualization tool like Tableau to visualize the results. We might also use machine learning algorithms to predict user behavior and improve the recommendation algorithm.

Follow Up Detailed Design Questions and Answers

  1. How should the system handle the massive write operations for new tweets? To handle massive write operations, our approach is to initially write the requests into a message queue. When a user posts a tweet, the request is placed into the queue and subsequently processed in the background. This allows the user to continue interacting with the system without having to wait for the completion of the write operation.

    As for writing data from the message queue into the database, we utilize a distributed database that can scale horizontally. This enables the system to distribute the write load across multiple nodes, thereby reducing the load on any single node.

  2. How should the system generate a unique ID for each tweet?

    We can establish a Distributed ID Generation Service to generate unique IDs. A distributed unique ID generator service, also known as a Key Generation Service (KGS), can be employed to create unique IDs in a distributed system. This is particularly useful when numerous nodes or services need to independently generate unique IDs without overlaps. Examples of such services include Twitter's Snowflake and Instagram's ID generation method.

    Using ID Generator Service

    For a deeper understanding, you can refer to the "Distributed ID Generation Service" section in the URL Shortener Solution.

  3. How is the Timeline feature implemented? As described in the high-level design section, we can use "fan-out-on-write". This is a data distribution strategy commonly used in systems where there are many more read operations than write operations. The principle behind this strategy is that when a user posts a tweet, the tweet is immediately written to the timelines of all their followers.

    When a user who hasn't logged in for a while requests the timeline, the system first checks the cache. If the cache has the data, it is returned directly. If the cache does not have the data, the system queries the database, updates the cache, and then returns the data.

  4. How to handle the "celebrity problem", where an account with a large number of followers posts a tweet and potentially causes a surge in traffic?

    In the case of a "celebrity" user with a large number of followers, the "fan-out-on-write" operation can be quite expensive, as it involves updating the timelines of millions of followers. This could potentially cause a surge in traffic, leading to performance issues.

    To mitigate this, a hybrid "fan-out-on-write" strategy can be used. In this approach, when a "celebrity" user posts a tweet, the system immediately pushes the tweet to a subset of their followers - for instance, those who are currently online or have been recently active. For the remaining followers, the tweet is added to their timelines when they next request their timeline. This strategy helps balance the load on the system, ensuring that followers can see the tweets in a timely manner while preventing a sudden surge in traffic.

    So, while the "Fan-out-on-write" strategy typically involves writing a tweet to all followers' timelines upon posting, it can be adjusted based on the specific circumstances, such as the number of followers and system performance considerations.

  5. If a timeline is maintained for each person, how are the tweets sent before following and unfollowing handled?

    When a user follows another user, the system can add the recent tweets of the followed user to the follower's timeline. This ensures that the follower can immediately see the tweets of the followed user on their timeline after following. The timeline also needs to keep track of the point in time to which tweets from the followed user have been loaded. When the user browses to a point in time on the timeline where tweets have not been loaded, the system needs to fetch the tweets from the followed user and add them to the timeline.

    When a user unfollows another user, the system can remove the unfollowed user's tweets from the unfollower's timeline. This ensures that the unfollower no longer sees the tweets of the unfollowed user on their timeline after unfollowing. However, if the user has interacted with any of the unfollowed user's tweets (for example, by retweeting or liking), those interactions will still remain on the user's timeline.

    In summary, the timeline is a dynamic feature that changes according to the user's current following situation, including loading tweets that were posted before the follow and removing tweets from users who have been unfollowed.

  6. How will the system prevent abuse or overly heavy use by a single user or IP?

    The system can prevent abuse or overly heavy use by implementing rate limiting. This involves limiting the number of requests that a user or IP can make within a certain period of time. If a user or IP exceeds the limit, their requests are temporarily blocked. Rate limiting can help to prevent spam, abuse, and denial-of-service attacks.