Difficulty: easy
Description: Design ChatGPT
Introduction
ChatGPT is an AI assistant powered by large language models (LLMs) that can understand and generate human-like text. When a user interacts with ChatGPT, they provide a text prompt and receive a generated response. This process involves several key components and concepts:
Inference vs. Training
In this design problem, we are specifically focusing on the inference infrastructure - the system that allows users to interact with an already trained LLM. We are not designing the training infrastructure, which is a separate, computationally intensive process that creates the model.
In other words, we are designing ChatGPT, the application that allows users to chat with an AI assistant.
Background
How Inference Works
Inference is the process of using a trained model to generate responses:
- User Input: The user sends a text prompt
- Tokenization: The prompt is converted into tokens (smaller text units)
- Model Processing: The LLM processes the tokens and predicts the next tokens
- Response Generation: These predicted tokens are assembled into a coherent response
- Streaming: Responses are typically streamed token-by-token for better user experience
The inference process is computationally expensive, requiring specialized hardware like GPUs or TPUs. This is why most applications don't run LLMs locally but instead connect to remote inference servers.
System Design Focus
In this problem, we're designing the application infrastructure that:
- Accepts user inputs
- Sends them to inference servers
- Receives and processes model outputs
- Manages conversation history
- Handles data persistence (i.e. storing the chat history)
We'll assume the existence of an inference server that provides the actual model capabilities, and focus on building the system around it.
Traffic Pattern
The traffic pattern of ChatGPT differs significantly from that of typical consumer applications. In contrast to platforms like Instagram—where content written by one user is read by many others—ChatGPT serves a strictly 1:1 interaction model between each user and the AI assistant.
Additionally, in real-world systems, it's common for traffic to be throttled or dropped when inference servers are overwhelmed. (You've probably seen the “Please try again later” message while chatting with overloaded AI services.)
Unique Load Profile:
- 1:1 request-response flow: Each user message triggers one inference and one response.
- Read-write ratio is effectively 1:1: Unlike typical social apps, there's no heavy read amplification.
- No fan-out or content reuse: Responses are personalized and not shared or cached across users.
- Latency-sensitive and compute-bound: System performance is primarily gated by model inference time.
This means we don’t need to use our master system design template, which is optimized for read-heavy workloads scenarios and eventual consistency.
Instead, our focus should be on:
- Handling inference backpressure,
- Ensuring data durability post-streaming,
- And implementing tiered quality of service for premium users.
Functional Requirements
Core Requirements
-
Login and Retrieve Chat History: Users should be able to retrieve and view their chat history upon login.
-
Send Chat and Store Response: Users send new prompts and receive real-time AI-generated responses.
Out of Scope
- Multi-device sync
- Multilingual support
- Fine-grained access control
- Plugin integrations
Scale Requirements
- Supporting 10M DAUs using ChatGPT concurrently
- Each user sends 20 prompts per day on average
- System must support inference on each user prompt using an LLM
- Inference results must be streamed to user interface
- User messages and AI responses must be stored permanently
- Inference latency target < 2 seconds under normal load
- Read:write ratio of 1:1
Non-Functional Requirements
- High availability: chatbot should be usable 24/7
- High scalability: inference server should handle 10M DAU
- Low latency: response should be streamed as soon as generation starts
API Endpoints
GET /chat/history
Retrieve past chat history for the current user session.
Response Body:
{ chats: [ { id: string, user_message: string, assistant_response: string, timestamp: string } ] }
POST /chat
Submit a new chat prompt and receive a streamed or full response.
Request Body:
{ message: string, stream: boolean (optional, default = false) }
Response Body:
Non-streaming (stream=false):
{
message_id: string,
response: string
}
Streaming (stream=true):
data: {"content": "Hello"}
data: {"content": " world"}
data: [DONE]
POST /chat/store
Persist a completed conversation turn (user + assistant) into the database.
Request Body:
{ message_id: string, user_message: string, assistant_response: string, timestamp: string }
Response Body:
{ status: "success" }
High Level Design
1. Login and Retrieve Chat History
Users should be able to retrieve and view their chat history upon login.
When the user logs in, the Client calls the App Server to fetch chat history. The App Server queries the Database using user_id, retrieves chat logs, and sends them back to the client to render.
2. Send Chat and Store Response
Users send new prompts and receive real-time AI-generated responses.
The Client sends a new chat message to the App Server. It forwards the prompt to the Inference Server. The response from the LLM is returned to the App Server.
Afterward, the App Server requests the Database to store the full chat exchange (user + assistant message). Once confirmed, the App Server delivers the response to the Client.
Deep Dive Questions
How to implement streaming? (Server and Frontend UX)
Streaming in ChatGPT is implemented using incremental token generation on the server and real-time rendering on the frontend. This improves perceived latency and mimics human-like typing behavior.
Server-Side Implementation:
- The LLM generates one token at a time.
- After each token, the inference server emits it immediately to the client using Server-Sent Events (SSE) or HTTP chunked responses.
- Streaming is triggered via a
stream=true
flag on the API request.
Example token stream:
data: {"choices":[{"delta":{"content":"The"}}]}
data: {"choices":[{"delta":{"content":" cat"}}]}
data: [DONE]
Frontend Implementation:
- The client listens to incoming token events via
EventSource
. - Each token is appended to the chat box as it's received.
- A final
[DONE]
signal tells the UI to stop rendering.
Example code:
const source = new EventSource('/chat?stream=true');
source.onmessage = (e) => {
const token = JSON.parse(e.data).choices[0].delta.content;
appendToTextbox(token);
};
How do we handle database writes during or after streaming?
Grasping the building blocks ("the lego pieces")
This part of the guide will focus on the various components that are often used to construct a system (the building blocks), and the design templates that provide a framework for structuring these blocks.
Core Building blocks
At the bare minimum you should know the core building blocks of system design
- Scaling stateless services with load balancing
- Scaling database reads with replication and caching
- Scaling database writes with partition (aka sharding)
- Scaling data flow with message queues
System Design Template
With these building blocks, you will be able to apply our template to solve many system design problems. We will dive into the details in the Design Template section. Here’s a sneak peak:
Additional Building Blocks
Additionally, you will want to understand these concepts
- Processing large amount of data (aka “big data”) with batch and stream processing
- Particularly useful for solving data-intensive problems such as designing an analytics app
- Achieving consistency across services using distribution transaction or event sourcing
- Particularly useful for solving problems that require strict transactions such as designing financial apps
- Full text search: full-text index
- Storing data for the long term: data warehousing
On top of these, there are ad hoc knowledge you would want to know tailored to certain problems. For example, geohashing for designing location-based services like Yelp or Uber, operational transform to solve problems like designing Google Doc. You can learn these these on a case-by-case basis. System design interviews are supposed to test your general design skills and not specific knowledge.
Working through problems and building solutions using the building blocks
Finally, we have a series of practical problems for you to work through. You can find the problem in /problems. This hands-on practice will not only help you apply the principles learned but will also enhance your understanding of how to use the building blocks to construct effective solutions. The list of questions grow. We are actively adding more questions to the list.