Difficulty: easy
Description: Design ChatGPT
Introduction
ChatGPT is an AI assistant powered by large language models (LLMs) that can understand and generate human-like text. When a user interacts with ChatGPT, they provide a text prompt and receive a generated response. This process involves several key components and concepts:
Inference vs. Training
In this design problem, we are specifically focusing on the inference infrastructure - the system that allows users to interact with an already trained LLM. We are not designing the training infrastructure, which is a separate, computationally intensive process that creates the model.
In other words, we are designing ChatGPT, the application that allows users to chat with an AI assistant.
Background
How Inference Works
Inference is the process of using a trained model to generate responses:
- User Input: The user sends a text prompt
- Tokenization: The prompt is converted into tokens (smaller text units)
- Model Processing: The LLM processes the tokens and predicts the next tokens
- Response Generation: These predicted tokens are assembled into a coherent response
- Streaming: Responses are typically streamed token-by-token for better user experience
The inference process is computationally expensive, requiring specialized hardware like GPUs or TPUs. This is why most applications don't run LLMs locally but instead connect to remote inference servers.
System Design Focus
In this problem, we're designing the application infrastructure that:
- Accepts user inputs
- Sends them to inference servers
- Receives and processes model outputs
- Manages conversation history
- Handles data persistence (i.e. storing the chat history)
We'll assume the existence of an inference server that provides the actual model capabilities, and focus on building the system around it.
Traffic Pattern
The traffic pattern of ChatGPT differs significantly from that of typical consumer applications. In contrast to platforms like Instagram—where content written by one user is read by many others—ChatGPT serves a strictly 1:1 interaction model between each user and the AI assistant.
Additionally, in real-world systems, it's common for traffic to be throttled or dropped when inference servers are overwhelmed. (You've probably seen the “Please try again later” message while chatting with overloaded AI services.)
Unique Load Profile:
- 1:1 request-response flow: Each user message triggers one inference and one response.
- Read-write ratio is effectively 1:1: Unlike typical social apps, there's no heavy read amplification.
- No fan-out or content reuse: Responses are personalized and not shared or cached across users.
- Latency-sensitive and compute-bound: System performance is primarily gated by model inference time.
This means we don’t need to use our master system design template, which is optimized for read-heavy workloads scenarios and eventual consistency.
Instead, our focus should be on:
- Handling inference backpressure,
- Ensuring data durability post-streaming,
- And implementing tiered quality of service for premium users.
Functional Requirements
Core Requirements
-
Login and Retrieve Chat History: Users should be able to retrieve and view their chat history upon login.
-
Send Chat and Store Response: Users send new prompts and receive real-time AI-generated responses.
Out of Scope
- Multi-device sync
- Multilingual support
- Fine-grained access control
- Plugin integrations
Scale Requirements
- Supporting 10M DAUs using ChatGPT concurrently
- Each user sends 20 prompts per day on average
- System must support inference on each user prompt using an LLM
- Inference results must be streamed to user interface
- User messages and AI responses must be stored permanently
- Inference latency target < 2 seconds under normal load
- Read:write ratio of 1:1
Non-Functional Requirements
- High availability: chatbot should be usable 24/7
- High scalability: inference server should handle 10M DAU
- Low latency: response should be streamed as soon as generation starts
API Endpoints
GET /chat/history
Retrieve past chat history for the current user session.
Response Body:
{ chats: [ { id: string, user_message: string, assistant_response: string, timestamp: string } ] }
POST /chat
Submit a new chat prompt and receive a streamed or full response.
Response Body:
Non-streaming (stream=false):
{
message_id: string,
response: string
}
Streaming (stream=true):
data: {"content": "Hello"}
data: {"content": " world"}
data: [DONE]
POST /chat/store
Persist a completed conversation turn (user + assistant) into the database.
Response Body:
{ status: "success" }
High Level Design
1. Login and Retrieve Chat History
Users should be able to retrieve and view their chat history upon login.
When the user logs in, the Client calls the App Server to fetch chat history. The App Server queries the Database using user_id, retrieves chat logs, and sends them back to the client to render.
2. Send Chat and Store Response
Users send new prompts and receive real-time AI-generated responses.
The Client sends a new chat message to the App Server. It forwards the prompt to the Inference Server. The response from the LLM is returned to the App Server.
Afterward, the App Server requests the Database to store the full chat exchange (user + assistant message). Once confirmed, the App Server delivers the response to the Client.
Deep Dive Questions
How to implement streaming? (Server and Frontend UX)
Streaming in ChatGPT is implemented using incremental token generation on the server and real-time rendering on the frontend. This improves perceived latency and mimics human-like typing behavior.
Server-Side Implementation:
- The LLM generates one token at a time.
- After each token, the inference server emits it immediately to the client using Server-Sent Events (SSE) or HTTP chunked responses.
- Streaming is triggered via a
stream=true
flag on the API request.
Example token stream:
data: {"choices":[{"delta":{"content":"The"}}]}
data: {"choices":[{"delta":{"content":" cat"}}]}
data: [DONE]
Frontend Implementation:
- The client listens to incoming token events via
EventSource
. - Each token is appended to the chat box as it's received.
- A final
[DONE]
signal tells the UI to stop rendering.
Example code:
const source = new EventSource('/chat?stream=true');
source.onmessage = (e) => {
const token = JSON.parse(e.data).choices[0].delta.content;
appendToTextbox(token);
};
How do we handle database writes during or after streaming?
Since streaming delivers responses incrementally, we must be careful about when to persist chat history.
- Persist only after full message generation completes.
- Once the
[DONE]
token is sent, the App Server writes the complete user message and assistant response to the database.
Why Not Write Midway?
- Partial writes can corrupt chat history or create broken UX (e.g., cut-off answers).
- Also, the LLM may backtrack during sampling (e.g. when the model is generating a response, it may decide to change the previous token), so intermediate tokens may not match the final result. Additionally, there will be a combination of failure scenarios that we need to handle. And there is not very much to gain considering the user can always retry on their own.
Optional Resilience to improve durability:
- Use a temporary buffer or in-memory queue for in-progress messages.
- Mark DB entries with a
status: in_progress
, updated tocompleted
post-generation.
What happens when inference servers are overwhelmed?
When inference traffic spikes, it's impossible to serve all requests. There are two possible user experiences:
- Everyone receives slower responses (high latency, timeout).
- Some users are denied service temporarily, told to retry later while others are served at normal rate.
In practice, Option 2 is universally adopted, because Option 1 degrades the experience for all users and often leads to server collapse.
Therefore, the frontend is designed to display a clear message like:
“Our AI servers are busy. Please try again shortly.”
This ensures fast feedback and retains user trust.
How should the system protect the inference servers under high traffic?
A reverse proxy or load balancer (e.g., Nginx, Envoy) is deployed in front of inference servers.
It performs two key roles:
- Maintains heartbeat polling with each inference server to track memory load or queue depth.
- Routes only to servers under safe load thresholds.
When no servers are available:
- The proxy returns a
503 Overloaded
response. - The App Server relays this to the frontend.
- The frontend retries using exponential backoff: wait 10s, 20s, 40s, etc.
This protects the cluster from cascading failure and ensures predictable behavior under stress.
How to implement tiered quality of service for premium users? How to implement privileged traffic (premium users) so that they get priority when the system is under heavy load?
The goal is to ensure:
- premium users are always served,
- maximizing overall inference utilization.
Static Reservation
One simple idea is to statically reserve a portion of inference servers for premium traffic. However:
- During overload, free users are denied—even if premium resources are idle. This is wasteful and bad user experience for free-tier users.
- It's difficult to size the reservation correctly because premium traffic volume may fluctuate. The fluctuation comes from both the number of premium users and their usage pattern (like when some event happened), making estimation a non-trivial task. This will translate to more waste in reserved resources.
Load-Based Filtering
A better approach is to treat the entire inference fleet as a shared pool and implement load-based filtering:
-
When system load exceeds a critical threshold (e.g., GPU memory pressure or queue depth):
- Free-tier traffic is rejected immediately.
- Premium traffic is still admitted, as long as the server isn't fully saturated.
-
Once load returns below the threshold:
- Free-tier traffic is re-admitted automatically.
- No servers sit idle during this time — all resources are dynamically shared.
This approach achieves the same effect as reservation, but without idle compute. This system works because:
- Most load comes from free-tier traffic, so denying just that class results in a rapid load drop.
- Since inference requests are short (e.g., 1–2 seconds), even a full stop on free-tier requests clears queues quickly.
- Premium traffic, being smaller in volume, can safely continue without tipping the system into overload.
- No need to forecast premium traffic or maintain idle capacity — the system self-balances.
Retry strategies
- Premium requests use linear backoff (e.g., retry every 10s).
- Free-tier requests use exponential backoff (e.g., 10s → 20s → 40s…).
This ensures premium users retry faster and reclaim capacity early, while free users back off to reduce contention.
Let's look at two edge cases with overload:
Case 1: No Premium Traffic
- During overload, free-tier traffic is denied.
- Server load drops quickly.
- After the "cool-down" period (e.g., 2 minutes), free-tier traffic resumes normally.
- Resources are never idle for long.
Case 2: Heavy Premium Load
- Free users are blocked entirely.
- Servers focus solely on premium users.
- Load drops due to free-tier denial, but premium traffic immediately fills the gap.
- Free traffic remains throttled until premium demand is satisfied.
This design achieves both goals:
- Premium users are almost always served unless all capacity is completely exhausted.
- Inference resources are never wasted, because unused premium capacity is made available to free users automatically.