The System Design Courses

Go beyond memorizing solutions to specific problems. Learn the core concepts, patterns and templates to solve any problem.

Start Learning

Design ChatGPT

Difficulty: easy

Description: Design ChatGPT

Introduction

ChatGPT is an AI assistant powered by large language models (LLMs) that can understand and generate human-like text. When a user interacts with ChatGPT, they provide a text prompt and receive a generated response. This process involves several key components and concepts:

Inference vs. Training

In this design problem, we are specifically focusing on the inference infrastructure - the system that allows users to interact with an already trained LLM. We are not designing the training infrastructure, which is a separate, computationally intensive process that creates the model.

Training (Not in Scope) Dataset Collection & Curation LLM Architecture Training Model Parameter Optimization Inference (Our Focus) User Prompt Processing Token Generation & Streaming Chat History Management Deployment

In other words, we are designing ChatGPT, the application that allows users to chat with an AI assistant.

Background

How Inference Works

Inference is the process of using a trained model to generate responses:

  1. User Input: The user sends a text prompt
  2. Tokenization: The prompt is converted into tokens (smaller text units)
  3. Model Processing: The LLM processes the tokens and predicts the next tokens
  4. Response Generation: These predicted tokens are assembled into a coherent response
  5. Streaming: Responses are typically streamed token-by-token for better user experience

LLM Inference Process

The inference process is computationally expensive, requiring specialized hardware like GPUs or TPUs. This is why most applications don't run LLMs locally but instead connect to remote inference servers.

System Design Focus

In this problem, we're designing the application infrastructure that:

  • Accepts user inputs
  • Sends them to inference servers
  • Receives and processes model outputs
  • Manages conversation history
  • Handles data persistence (i.e. storing the chat history)

We'll assume the existence of an inference server that provides the actual model capabilities, and focus on building the system around it.

Traffic Pattern

The traffic pattern of ChatGPT differs significantly from that of typical consumer applications. In contrast to platforms like Instagram—where content written by one user is read by many others—ChatGPT serves a strictly 1:1 interaction model between each user and the AI assistant.

Additionally, in real-world systems, it's common for traffic to be throttled or dropped when inference servers are overwhelmed. (You've probably seen the “Please try again later” message while chatting with overloaded AI services.)

Unique Load Profile:

  • 1:1 request-response flow: Each user message triggers one inference and one response.
  • Read-write ratio is effectively 1:1: Unlike typical social apps, there's no heavy read amplification.
  • No fan-out or content reuse: Responses are personalized and not shared or cached across users.
  • Latency-sensitive and compute-bound: System performance is primarily gated by model inference time.

This means we don’t need to use our master system design template, which is optimized for read-heavy workloads scenarios and eventual consistency.

Instead, our focus should be on:

  • Handling inference backpressure,
  • Ensuring data durability post-streaming,
  • And implementing tiered quality of service for premium users.

Functional Requirements

Core Requirements

  1. Login and Retrieve Chat History: Users should be able to retrieve and view their chat history upon login.

  2. Send Chat and Store Response: Users send new prompts and receive real-time AI-generated responses.

Out of Scope

  • Multi-device sync
  • Multilingual support
  • Fine-grained access control
  • Plugin integrations

Scale Requirements

  • Supporting 10M DAUs using ChatGPT concurrently
  • Each user sends 20 prompts per day on average
  • System must support inference on each user prompt using an LLM
  • Inference results must be streamed to user interface
  • User messages and AI responses must be stored permanently
  • Inference latency target < 2 seconds under normal load
  • Read:write ratio of 1:1

Non-Functional Requirements

  • High availability: chatbot should be usable 24/7
  • High scalability: inference server should handle 10M DAU
  • Low latency: response should be streamed as soon as generation starts

API Endpoints

GET /chat/history

Retrieve past chat history for the current user session.

Response Body:

{ chats: [ { id: string, user_message: string, assistant_response: string, timestamp: string } ] }

POST /chat

Submit a new chat prompt and receive a streamed or full response.

Response Body:

Non-streaming (stream=false): { message_id: string, response: string } Streaming (stream=true): data: {"content": "Hello"} data: {"content": " world"} data: [DONE]

POST /chat/store

Persist a completed conversation turn (user + assistant) into the database.

Response Body:

{ status: "success" }

High Level Design

1. Login and Retrieve Chat History

Users should be able to retrieve and view their chat history upon login.

When the user logs in, the Client calls the App Server to fetch chat history. The App Server queries the Database using user_id, retrieves chat logs, and sends them back to the client to render.

retrieve chat history

2. Send Chat and Store Response

Users send new prompts and receive real-time AI-generated responses.

The Client sends a new chat message to the App Server. It forwards the prompt to the Inference Server. The response from the LLM is returned to the App Server.

Afterward, the App Server requests the Database to store the full chat exchange (user + assistant message). Once confirmed, the App Server delivers the response to the Client. send chat and store response

Deep Dive Questions

How to implement streaming? (Server and Frontend UX)

Streaming in ChatGPT is implemented using incremental token generation on the server and real-time rendering on the frontend. This improves perceived latency and mimics human-like typing behavior.

Server-Side Implementation:

  • The LLM generates one token at a time.
  • After each token, the inference server emits it immediately to the client using Server-Sent Events (SSE) or HTTP chunked responses.
  • Streaming is triggered via a stream=true flag on the API request.

Example token stream:

data: {"choices":[{"delta":{"content":"The"}}]}
data: {"choices":[{"delta":{"content":" cat"}}]}
data: [DONE]

Frontend Implementation:

  • The client listens to incoming token events via EventSource.
  • Each token is appended to the chat box as it's received.
  • A final [DONE] signal tells the UI to stop rendering.

Example code:

const source = new EventSource('/chat?stream=true'); source.onmessage = (e) => { const token = JSON.parse(e.data).choices[0].delta.content; appendToTextbox(token); };

How do we handle database writes during or after streaming?

Since streaming delivers responses incrementally, we must be careful about when to persist chat history.

  • Persist only after full message generation completes.
  • Once the [DONE] token is sent, the App Server writes the complete user message and assistant response to the database.

Why Not Write Midway?

  • Partial writes can corrupt chat history or create broken UX (e.g., cut-off answers).
  • Also, the LLM may backtrack during sampling (e.g. when the model is generating a response, it may decide to change the previous token), so intermediate tokens may not match the final result. Additionally, there will be a combination of failure scenarios that we need to handle. And there is not very much to gain considering the user can always retry on their own.

Optional Resilience to improve durability:

  • Use a temporary buffer or in-memory queue for in-progress messages.
  • Mark DB entries with a status: in_progress, updated to completed post-generation.

What happens when inference servers are overwhelmed?

When inference traffic spikes, it's impossible to serve all requests. There are two possible user experiences:

  1. Everyone receives slower responses (high latency, timeout).
  2. Some users are denied service temporarily, told to retry later while others are served at normal rate.

In practice, Option 2 is universally adopted, because Option 1 degrades the experience for all users and often leads to server collapse.

Therefore, the frontend is designed to display a clear message like:

“Our AI servers are busy. Please try again shortly.”

This ensures fast feedback and retains user trust.

How should the system protect the inference servers under high traffic?

A reverse proxy or load balancer (e.g., Nginx, Envoy) is deployed in front of inference servers.

reverse proxy

It performs two key roles:

  • Maintains heartbeat polling with each inference server to track memory load or queue depth.
  • Routes only to servers under safe load thresholds.

When no servers are available:

  • The proxy returns a 503 Overloaded response.
  • The App Server relays this to the frontend.
  • The frontend retries using exponential backoff: wait 10s, 20s, 40s, etc.

This protects the cluster from cascading failure and ensures predictable behavior under stress.

How to implement tiered quality of service for premium users? How to implement privileged traffic (premium users) so that they get priority when the system is under heavy load?

The goal is to ensure:

  • premium users are always served,
  • maximizing overall inference utilization.

Static Reservation

One simple idea is to statically reserve a portion of inference servers for premium traffic. However:

  • During overload, free users are denied—even if premium resources are idle. This is wasteful and bad user experience for free-tier users.
  • It's difficult to size the reservation correctly because premium traffic volume may fluctuate. The fluctuation comes from both the number of premium users and their usage pattern (like when some event happened), making estimation a non-trivial task. This will translate to more waste in reserved resources.

Load-Based Filtering

A better approach is to treat the entire inference fleet as a shared pool and implement load-based filtering:

tiered qos

  • When system load exceeds a critical threshold (e.g., GPU memory pressure or queue depth):

    • Free-tier traffic is rejected immediately.
    • Premium traffic is still admitted, as long as the server isn't fully saturated.
  • Once load returns below the threshold:

    • Free-tier traffic is re-admitted automatically.
    • No servers sit idle during this time — all resources are dynamically shared.

This approach achieves the same effect as reservation, but without idle compute. This system works because:

  • Most load comes from free-tier traffic, so denying just that class results in a rapid load drop.
  • Since inference requests are short (e.g., 1–2 seconds), even a full stop on free-tier requests clears queues quickly.
  • Premium traffic, being smaller in volume, can safely continue without tipping the system into overload.
  • No need to forecast premium traffic or maintain idle capacity — the system self-balances.

Retry strategies

  • Premium requests use linear backoff (e.g., retry every 10s).
  • Free-tier requests use exponential backoff (e.g., 10s → 20s → 40s…).

retry strategies

This ensures premium users retry faster and reclaim capacity early, while free users back off to reduce contention.

Let's look at two edge cases with overload:

tiered qos edge cases

Case 1: No Premium Traffic

  • During overload, free-tier traffic is denied.
  • Server load drops quickly.
  • After the "cool-down" period (e.g., 2 minutes), free-tier traffic resumes normally.
  • Resources are never idle for long.

Case 2: Heavy Premium Load

  • Free users are blocked entirely.
  • Servers focus solely on premium users.
  • Load drops due to free-tier denial, but premium traffic immediately fills the gap.
  • Free traffic remains throttled until premium demand is satisfied.

This design achieves both goals:

  • Premium users are almost always served unless all capacity is completely exhausted.
  • Inference resources are never wasted, because unused premium capacity is made available to free users automatically.

The System Design Courses

Go beyond memorizing solutions to specific problems. Learn the core concepts, patterns and templates to solve any problem.

Start Learning

System Design Master Template

Comments