Difficulty: easy
Description: Design a coding contest platform like LeetCode.
Introduction
There is a question bank that stores thousands of programming questions. For each question, users can write, compile and submit code against test cases and get results. User submitted code needs to be persisted. Every week, there is a contest where people compete to solve questions as quickly as possible. Users will be ranked based on their accuracy and speed. We’ll need a leaderboard that shows rankings in real time.
If you prefer a video walkthrough of this problem, you can watch it right here:
Functional Requirements
Core Requirements
-
View Problems: This part is quite straightforward. Users should be able to view a question's description, input/output examples, and constraints. And they should be able to view a list of problems.
-
Submit Solution: Users should be able to solve coding questions by submitting their solution code and running it against built-in test cases to get results.
-
Coding Contest: User can participate in coding contest. The contest is a timed event with a fixed duration of 2 hours consisting of four questions. The score is calculated based on the number of questions solved and the time taken to solve them. The results will be displayed in real time. The leaderboard will show the top 50 users with their usernames and scores.
Out of Scope
- Authentication
- Authorization
- User management
- Contest history
Scale Requirements
- Supporting 10k users participating contests concurrently
- There is at most one contest a day
- Each contest lasts 2 hours
- Each user submits solutions 20 times on average
- Each submission executes 20 test cases on average
- User submitted code have a retention policy of 1 month after which they will be deleted.
- Assuming that the storage space required for the solution of each coding question is 10KB.
- Assuming that the read:write ratio is 2:1.
Non-Functional Requirements
- high availability: the website should be accessible 24/7
- high scalability: the website should be able to handle 10k users submitting code concurrently
- low latency: during a contest, the leaderboard should be updated in real time
- security: user code cannot break server
Resource Estimation
QPS
Assume each user participate one contest a day. Given 10k users participating in a two-hour contenst concurrently, each user submits solutions 20 times on average. Therefore, the total number of writes for a contest is: 10k * 20 = 200k writes
. Assume a read:write ratio of 2:1, the total number of read requests for a contest is: 200k * 2 = 400k reads
. This comes to 200k / (2 * 3600s) ≈ 25 QPS
, a manageable number.
Storage
Assuming about 100 lines of code per problem, we roughly estimate each submission is about 10KB in size for verbose languages including the driver code and test outputs. With users submitting 20 times per day, after one month, the total storage for user solutions is: 10k * 20 * 10KB * 30 ≈ 6GB
. This is a very manageable number.
Check out how QPS influences system design for more details on what we would need to handle the load.
API Endpoints
GET /problems?start={start}&end={end}
Get problems in a range. If start and end are not provided, get all problems with default page size.
Response Body:
{ problems: [ { id: string, title: string, difficulty: string } ] }
GET /problems/{problem_id}
Get problem details, including description, constraints, examples, and starter code.
Response Body:
{ id: string title: string difficulty: string description: string constraints: string examples: [{ input: string output: string }] starter_code: { language: string code: string } }
POST /problems/{problem_id}/submission
Submit code for a problem and get results.
Request Body:
{ user_id: string code: string language: string }
Response Body:
{ status: success | fail test_cases: [ { status: success | fail | timeout } ] }
GET /contests/{contest_id}/leaderboard
Get leaderboard for a contest.
Response Body:
{ ranking: [ { user_id: string, score: number } ] }
High Level Design
1. View Problems
This part is quite straightforward. Users should be able to view a question's description, input/output examples, and constraints. And they should be able to view a list of problems.
The design for viewing problems uses a standard three-tier architecture:
- Client: Frontend application sending HTTP GET requests
- Problems Service: Backend service with endpoints:
GET /problems?start={start}&end={end}
: List problemsGET /problems/{id}
: Get specific problem details
- Database: Stores problem data (id, title, description, difficulty, examples, constraints)
This simple setup efficiently handles problem viewing requests while maintaining separation of concerns.
2. Submit Solution
Users should be able to solve coding questions by submitting their solution code and running it against built-in test cases to get results.
To design the code evaluation flow for our LeetCode-like platform, we need to consider the high concurrency requirement of 10,000 users participating in contests simultaneously, each submitting solutions about 20 times on average. This is a classic use case using our design template.
We use a Code Evaluation Service as the entry point for code submissions. This service will be responsible for receiving code from users and initiating the evaluation process. However, to handle the high volume of submissions efficiently and securely, we introduce a Message Queue as a buffer between the submission process and the actual code execution.
Note that most system design questions requires microservices. For a simpler problem like this, we can use a single service to handle both problem service and code evaluation. However, code submission can be compute heavy and we do not want to bring down the entire service when handling a large number of submissions. Therefore, we opt to separate the two services.
When a user submits a solution, the Code Evaluation Service pushes a message containing the submission details (user ID, problem ID, and code) to the Message Queue. This decouples the submission process from the execution process, allowing us to handle traffic spikes more effectively.
On the other side of the Message Queue, we have multiple Code Execution Workers. These workers continuously poll the queue for new submissions. When they receive a message, they fetch the necessary test cases from the Problems database, execute the code in a sandboxed environment, and record the results. Let's compare different execution options for the sandbox environment:
Code Execution Options
API Server
Direct execution on the API server
How it works:
The user-submitted code is directly executed on the same server that hosts the API. The code is typically saved as a file on the server's filesystem and then run using the appropriate interpreter or compiler.
Pros:
- Simple to implement
- Low latency as there's no additional layer between the API and code execution
Cons:
- Severe security risks: Malicious code can potentially access or modify server data
- Poor isolation: A bug in user code could crash the entire server
- Resource management issues: One user's code could consume all server resources
Conclusion:
Not suitable for running user-submitted code. The security and stability risks far outweigh the simplicity benefits.
We will choose Container (Docker) over serverless functions for the faster startup time and less vendor lock-in.
To ensure security and isolation, each submission runs in a separate container, preventing malicious code from affecting our system or other users' submissions. We can use technologies like Docker for this purpose.
After execution, the results (including whether the code passed all test cases, execution time, and memory usage) are saved in the Submissions table.
This design allows us to scale horizontally by adding more Code Execution Workers as needed to handle increased load during peak times or popular contests. The use of a Message Queue also provides a level of fault tolerance - if a Code Execution Worker fails, the message remains in the queue and can be picked up by another worker.
3. Coding Contest
User can participate in coding contest. The contest is a timed event with a fixed duration of 2 hours consisting of four questions. The score is calculated based on the number of questions solved and the time taken to solve them. The results will be displayed in real time. The leaderboard will show the top 50 users with their usernames and scores.
A contest itself is bascially four problems with a time window. Running the contest itself is straightforward. We can use a centralized service to manage the contest, Contest Service. After a user submits a solution, it's graded by the Code Evaluation Service as described in the previous section. The results are then sent to the Contest Service to record the user's score in the database.
The key decision here is where to store the contest data, user's score, and the leaderboard. We need real time updates for the leaderboard and data changes frequently. We can use an in-memory database like Redis to store the leaderboard. Redis is a fast, in-memory database that is well suited for high performance, real time data processing. Check the deep dive below for more details on how the leaderboard is implemented.
Deep Dive Questions
How does the Code Evaluation Service decides if a submission is correct?
The Code Evaluation Service decides if a submission is correct by comparing the output of the code with the expected output.
Here's an example of how this works:
Requirement:
- Problem: Add two numbers
- Input: 2, 3
- Expected Output: 5
- Time Limit: 1 second
User submitted code:
def add(a, b):
return a + b
Test case files:
test_case_1.in
2, 3
test_case_1_expected_output.out
5
test_case_1.out
5
The Code Evaluation Service will run the code with the input and get an output file test_case_1.out
.
It will then compare test_case_1.out
with test_case_1_expected_output.out
. If they are the same, the submission is marked as correct.
Otherwise, it's marked as incorrect.
If the code fails to execute, the submission is marked as failed. If the code execution time exceeds the time limit, the submission is marked as timeout.
The test cases files such as test_case_1.in
and test_case_1_expected_output.out
are stored in the database.
How to ensure security and isolation in code execution?
Running the code in a sandboxed environment like a container already provides a measure of security and isolation.
We can further enhance security by limiting the resources that each code execution can use, such as CPU, memory, and disk I/O.
We can also use technologies like seccomp
to further limit the system calls that the code can make.
To be specific, to prevent malicious code from affecting other users' submissions, we can use the following techniques:
- Resource Limitation: Limit the resources that each code execution can use, such as CPU, memory, and disk I/O.
- Time Limitation: Limit the execution time of each code execution.
- System Call Limitation: Use technologies like
seccomp
to limit the system calls that the code can make. - User Isolation: Run each user's code in a separate container.
- Network Isolation: Limit the network access of each code execution.
- File System Isolation: Limit the file system access of each code execution.
Here's sample command while running docker run ...
:
--cpus="0.5" --memory="512M" --cap-drop="ALL" --security-opt="seccomp=unconfined" --network="none" --read-only
This limits the code execution to use at most 0.5 CPU, 512 MB of memory, and disable all capabilities.
How to implement a leaderboard that supports top N queries in real time?
Let's take a look our options:
Leaderboard Implementation Options
Redis Sorted Set
In-memory data structure store, using ZSET for leaderboard
How it works:
Uses a sorted set data structure to maintain scores. Each user is a member of the set, with their score as the sorting criteria.
Pros:
- Extremely fast read and write operations
- Built-in ranking and range queries
- Scalable and can handle high concurrency
Cons:
- Data is volatile (in-memory), requiring persistence strategies
- May require additional infrastructure setup
Conclusion:
Highly suitable for real-time leaderboards with frequent updates and queries.
We want to use a in-memory database like Redis because it's updated very frequently and older data can be discarded. Redis actually comes with a built-in leaderboard implementation. The data structure behind it is a sorted set (ZSET). Sorted sets combine unique elements (members) with a score, and they keep the elements ordered by the score automatically.
Sorted Set Internals
Internally, the sorted set is implemented as a hash map and a skip list. The hash map stores the mapping between the user IDs and the scores. The skip list keeps the scores in ascending order. Skip list is not a common data structure, so it's worth explaining it here.
- Skip list: A skip list is a probabilistic data structure that allows for fast search, insert, and delete operations. It's a series of linked lists with additional pointers to intermediate nodes, allowing for efficient traversal and search operations. Functionally, it's similar to a balanced binary search tree, but it's implemented using linked lists and randomization. The time complexity of search, insert, and delete operations is O(log n) on average. The randomization is used to ensure that the skip list is balanced, so the performance is good even if the list is not balanced. This is similar to how binary search trees need to be balanced by rotating nodes. Finally, the skip list can return top N in O(N(log M)) time where N is the number of elements to return and M is the number of elements in the list.
Here's an example of a skip list:
Level 3: [ 1 ]-------------------------------------> [ 20 ]
| |
Level 2: [ 1 ]-----------------> [ 7 ]-------------> [ 20 ]
| | |
Level 1: [ 1 ]-----> [ 3 ]-----> [ 7 ]------> [ 15 ]-[ 20 ]
| | | | |
Level 0: [ 1 ] [ 2 ] [ 3 ] [ 5 ] [ 7 ] [ 10 ] [ 15 ] [ 20 ] [ 25 ]
- Levels: Higher levels have fewer elements, allowing "skips" to speed up searching. For example, level 3 can directly skip from node 1 to 20, avoiding unnecessary steps.
- Nodes: Each node holds a value (e.g., 1, 3, 7, etc.). At level 0, every node is linked like a standard linked list.
- Links: Vertical links connect nodes across different levels, while horizontal links connect nodes at the same level.
The search works by starting at the highest level and moving down as needed:
- Start at the highest level (e.g., Level 3).
- Move right until the next node’s value is greater than or equal to the target.
- Move down a level if the next value is too large.
- Repeat this process at each level, moving right as far as possible, then down, until the target is found or confirmed absent.
Example: Searching for 15
- Start at node 1 on Level 3, drop down since 20 > 15.
- On Level 2, move from 1 to 7, then drop down again.
- On Level 1, move to 15 and stop.
This approach gives an average O(log n) time complexity by efficiently skipping unnecessary nodes.
The skip list is a series of linked lists with additional pointers to intermediate nodes, allowing for efficient traversal and search operations.
To get the leaderboard, i.e. get the top N
users, we can go down to the level 0 and find the last element in the list, and then perform a backward iteration to get the top N
users (the linked lists in a skip list are doubly linked).
In Redis, this is implemented as ZSET
. We use ZADD
to add a user and its score, and use ZRANGE
to get the top N
users.
Leaderboard Example
Here's a simple example:
ZADD leaderboard 100 user1 ZADD leaderboard 200 user2 ZADD leaderboard 300 user3 ZREVRANGE leaderboard 0 1 # returns the top 2 users
How to handle a large number of contestants submitting solutions at the same time?
Handling a large number of contestants submitting solutions simultaneously presents several challenges, particularly during peak times like the beginning and end of a contest. We can address this using a two-phase processing approach combined with other strategies:
-
Message Queue: Use a message queue to decouple the submission process from the execution process. This allows us to buffer submissions during peak times and process them as resources become available.
-
Pre-scaling: Since auto-scaling may not be fast enough to handle sudden spikes, pre-scale the execution servers before the contest starts. This is especially important for the first and last 10 minutes of a contest when submission rates are typically highest.
-
Two-Phase Processing:
- Phase 1 (During Contest): Implement partial evaluation
- Run submissions against only about 10% of the test cases.
- Provide a partial standing for users, giving them immediate feedback.
- Phase 2 (Post-Contest): Complete full evaluation
- After the contest deadline, run the submissions on the remaining test cases.
- Determine the official results based on full evaluation.
- Phase 1 (During Contest): Implement partial evaluation
This two-phase approach:
- Reduces the load on the system during the contest.
- Allows for more accurate final results.
- Is similar to the approach used by platforms like Codeforces for large contests.
-
Efficient Resource Allocation: Optimize resource allocation by scaling down during less active periods of the contest and scaling up for the final rush.
-
Prioritization: Implement a prioritization system in the message queue to ensure that at least one submission from each user is processed quickly, providing everyone with some feedback.
-
User Communication: Clearly communicate to users that their submissions are being partially evaluated during the contest and that final results will be available after full evaluation post-contest.
This approach, centered around the two-phase processing technique, balances immediate feedback for users with efficient resource utilization. While it means that final results won't be available immediately after the contest ends, it allows for handling a much larger number of concurrent submissions with existing resources. The trade-off between real-time complete results and system scalability is often necessary for large-scale coding competitions.