The System Design Courses

Go beyond memorizing solutions to specific problems. Learn the core concepts, patterns and templates to solve any problem.

Start Learning

Design ChatGPT

Difficulty: easy

Description: Design ChatGPT

Introduction

ChatGPT is an AI assistant powered by large language models (LLMs) that can understand and generate human-like text. When a user interacts with ChatGPT, they provide a text prompt and receive a generated response. This process involves several key components and concepts:

Inference vs. Training

In this design problem, we are specifically focusing on the inference infrastructure - the system that allows users to interact with an already trained LLM. We are not designing the training infrastructure, which is a separate, computationally intensive process that creates the model.

Training (Not in Scope) Dataset Collection & Curation LLM Architecture Training Model Parameter Optimization Inference (Our Focus) User Prompt Processing Token Generation & Streaming Chat History Management Deployment

In other words, we are designing ChatGPT, the application that allows users to chat with an AI assistant.

Background

How Inference Works

Inference is the process of using a trained model to generate responses:

  1. User Input: The user sends a text prompt
  2. Tokenization: The prompt is converted into tokens (smaller text units)
  3. Model Processing: The LLM processes the tokens and predicts the next tokens
  4. Response Generation: These predicted tokens are assembled into a coherent response
  5. Streaming: Responses are typically streamed token-by-token for better user experience

LLM Inference Process

The inference process is computationally expensive, requiring specialized hardware like GPUs or TPUs. This is why most applications don't run LLMs locally but instead connect to remote inference servers.

System Design Focus

In this problem, we're designing the application infrastructure that:

  • Accepts user inputs
  • Sends them to inference servers
  • Receives and processes model outputs
  • Manages conversation history
  • Handles data persistence (i.e. storing the chat history)

We'll assume the existence of an inference server that provides the actual model capabilities, and focus on building the system around it.

Traffic Pattern

The traffic pattern of ChatGPT differs significantly from that of typical consumer applications. In contrast to platforms like Instagram—where content written by one user is read by many others—ChatGPT serves a strictly 1:1 interaction model between each user and the AI assistant.

Additionally, in real-world systems, it's common for traffic to be throttled or dropped when inference servers are overwhelmed. (You've probably seen the “Please try again later” message while chatting with overloaded AI services.)

Unique Load Profile:

  • 1:1 request-response flow: Each user message triggers one inference and one response.
  • Read-write ratio is effectively 1:1: Unlike typical social apps, there's no heavy read amplification.
  • No fan-out or content reuse: Responses are personalized and not shared or cached across users.
  • Latency-sensitive and compute-bound: System performance is primarily gated by model inference time.

This means we don’t need to use our master system design template, which is optimized for read-heavy workloads scenarios and eventual consistency.

Instead, our focus should be on:

  • Handling inference backpressure,
  • Ensuring data durability post-streaming,
  • And implementing tiered quality of service for premium users.

Functional Requirements

Core Requirements

  1. Login and Retrieve Chat History: Users should be able to retrieve and view their chat history upon login.

  2. Send Chat and Store Response: Users send new prompts and receive real-time AI-generated responses.

Out of Scope

  • Multi-device sync
  • Multilingual support
  • Fine-grained access control
  • Plugin integrations

Scale Requirements

  • Supporting 10M DAUs using ChatGPT concurrently
  • Each user sends 20 prompts per day on average
  • System must support inference on each user prompt using an LLM
  • Inference results must be streamed to user interface
  • User messages and AI responses must be stored permanently
  • Inference latency target < 2 seconds under normal load
  • Read:write ratio of 1:1

Non-Functional Requirements

  • High availability: chatbot should be usable 24/7
  • High scalability: inference server should handle 10M DAU
  • Low latency: response should be streamed as soon as generation starts

API Endpoints

GET /chat/history

Retrieve past chat history for the current user session.

Response Body:

{ chats: [ { id: string, user_message: string, assistant_response: string, timestamp: string } ] }

POST /chat

Submit a new chat prompt and receive a streamed or full response.

Request Body:

{ message: string, stream: boolean (optional, default = false) }

Response Body:

Non-streaming (stream=false): { message_id: string, response: string } Streaming (stream=true): data: {"content": "Hello"} data: {"content": " world"} data: [DONE]

POST /chat/store

Persist a completed conversation turn (user + assistant) into the database.

Request Body:

{ message_id: string, user_message: string, assistant_response: string, timestamp: string }

Response Body:

{ status: "success" }

High Level Design

1. Login and Retrieve Chat History

Users should be able to retrieve and view their chat history upon login.

When the user logs in, the Client calls the App Server to fetch chat history. The App Server queries the Database using user_id, retrieves chat logs, and sends them back to the client to render.

retrieve chat history

2. Send Chat and Store Response

Users send new prompts and receive real-time AI-generated responses.

The Client sends a new chat message to the App Server. It forwards the prompt to the Inference Server. The response from the LLM is returned to the App Server.

Afterward, the App Server requests the Database to store the full chat exchange (user + assistant message). Once confirmed, the App Server delivers the response to the Client. send chat and store response

Deep Dive Questions

How to implement streaming? (Server and Frontend UX)

Streaming in ChatGPT is implemented using incremental token generation on the server and real-time rendering on the frontend. This improves perceived latency and mimics human-like typing behavior.

Server-Side Implementation:

  • The LLM generates one token at a time.
  • After each token, the inference server emits it immediately to the client using Server-Sent Events (SSE) or HTTP chunked responses.
  • Streaming is triggered via a stream=true flag on the API request.

Example token stream:

data: {"choices":[{"delta":{"content":"The"}}]}
data: {"choices":[{"delta":{"content":" cat"}}]}
data: [DONE]

Frontend Implementation:

  • The client listens to incoming token events via EventSource.
  • Each token is appended to the chat box as it's received.
  • A final [DONE] signal tells the UI to stop rendering.

Example code:

const source = new EventSource('/chat?stream=true'); source.onmessage = (e) => { const token = JSON.parse(e.data).choices[0].delta.content; appendToTextbox(token); };

How do we handle database writes during or after streaming?

Grasping the building blocks ("the lego pieces")

This part of the guide will focus on the various components that are often used to construct a system (the building blocks), and the design templates that provide a framework for structuring these blocks.

Core Building blocks

At the bare minimum you should know the core building blocks of system design

  • Scaling stateless services with load balancing
  • Scaling database reads with replication and caching
  • Scaling database writes with partition (aka sharding)
  • Scaling data flow with message queues

System Design Template

With these building blocks, you will be able to apply our template to solve many system design problems. We will dive into the details in the Design Template section. Here’s a sneak peak:

System Design Template

Additional Building Blocks

Additionally, you will want to understand these concepts

  • Processing large amount of data (aka “big data”) with batch and stream processing
    • Particularly useful for solving data-intensive problems such as designing an analytics app
  • Achieving consistency across services using distribution transaction or event sourcing
    • Particularly useful for solving problems that require strict transactions such as designing financial apps
  • Full text search: full-text index
  • Storing data for the long term: data warehousing

On top of these, there are ad hoc knowledge you would want to know tailored to certain problems. For example, geohashing for designing location-based services like Yelp or Uber, operational transform to solve problems like designing Google Doc. You can learn these these on a case-by-case basis. System design interviews are supposed to test your general design skills and not specific knowledge.

Working through problems and building solutions using the building blocks

Finally, we have a series of practical problems for you to work through. You can find the problem in /problems. This hands-on practice will not only help you apply the principles learned but will also enhance your understanding of how to use the building blocks to construct effective solutions. The list of questions grow. We are actively adding more questions to the list.

Pro Member
Pro Member Exclusive
Upgrade your account to continue
Benefits
check
Unlimited access to practice tool with AI grading
check
Unlimited access to expert-written solutions
check
Unlock 100+ lessons with full course access
check
Access to all future content while subscribed

The System Design Courses

Go beyond memorizing solutions to specific problems. Learn the core concepts, patterns and templates to solve any problem.

Start Learning

System Design Master Template

Comments