Unique ID Generators

When designing distributed systems, one of the first challenges you'll encounter is generating unique identifiers for your data. Every tweet needs an ID, every YouTube video needs an ID, every shortened URL needs an ID - but generating unique IDs across multiple servers without coordination is harder than it seems.

This article explores the main approaches to unique ID generation, focusing on their trade-offs and when to use each approach in system design interviews. Understanding these patterns is essential for problems like Design Twitter, Design YouTube, URL Shortener, and Design Pastebin.

The Unique ID Problem in Distributed Systems

In a single-server application, generating unique IDs is simple - you can use an auto-incrementing database sequence. But distributed systems break this approach:

The Problem:

Server 1: Creates user with ID 1, 2, 3...
Server 2: Creates user with ID 1, 2, 3...  ← Collision!
Server 3: Creates user with ID 1, 2, 3...  ← More collisions!

Why Auto-Increment Fails:

Coordination overhead: Requires synchronization between servers
Single point of failure: Centralized ID generator becomes a bottleneck
Performance impact: Network calls for every ID generation

Requirements for Distributed ID Generation:

Uniqueness: No two IDs should be identical across the entire system
High availability: ID generation shouldn't fail when individual servers go down
Low latency: ID generation should be fast (< 1ms)
Scalability: Should work with thousands of servers generating millions of IDs
Sortability (often desired): Newer IDs should be greater than older ones

Snowflake IDs: Twitter's Solution

Snowflake solves the distributed ID problem by embedding uniqueness into the ID structure itself. Instead of coordinating between servers, each server generates IDs independently using a clever bit allocation strategy.

How Snowflake Ensures Uniqueness

The Key Insight: If every server has a unique identifier and includes a timestamp, IDs will be unique across the entire system without coordination.

Snowflake ID Structure (64 bits):

| 1 bit   | 41 bits     | 10 bits    | 12 bits   |
| Sign    | Timestamp   | Machine ID | Sequence  |
| (0)     | (milliseconds)| (0-1023)  | (0-4095)  |

Structure of a Snowflake ID

How Each Part Ensures Uniqueness:

Timestamp (41 bits): Milliseconds since custom epoch
- Different times = different IDs
- Provides ~69 years of timestamps (2^41 milliseconds)
- Makes IDs sortable by creation time
Machine ID (10 bits): Unique identifier per server
- Supports up to 1,024 different servers
- Each server gets a unique ID during deployment
- Different machines = different IDs (even at same time)
- Coordination required: Must ensure no two servers have the same machine ID
Sequence Number (12 bits): Counter within the same millisecond
- Handles multiple ID requests in the same millisecond on the same machine
- Supports up to 4,096 IDs per millisecond per machine
- Resets to 0 each new millisecond

Example Generation:

Time: 1640995200000 (some millisecond)
Server 1, Machine ID 001: 1640995200000-001-0001
Server 1, Machine ID 001: 1640995200000-001-0002  (same ms, increment sequence)
Server 2, Machine ID 002: 1640995200000-002-0001  (same ms, different machine)
Next millisecond:        1640995200001-001-0001  (sequence resets)

Snowflake Implementation Example

Here's an interactive implementation showing the key concepts:

Snowflake ID Generator

Interactive implementation showing distributed unique ID generation with Twitter-like example

Loading Python environment... This may take a few seconds on first load.

Real-World Snowflake Trade-offs

Trade-offs:

✅ Advantages:
- No coordination needed (each server generates IDs independently)
- High performance (~2-3 million IDs per second per machine)
- Time-sortable (newer tweets have larger IDs)
- Roughly sequential (good for database B-tree performance)

❌ Limitations:
- Clock dependency (requires synchronized clocks across servers)
- Limited scale (only 1,024 machines, 4,096 IDs/ms per machine)
- Clock skew problems (duplicate IDs if server clock goes backwards)
- Machine ID coordination (must ensure each server has unique machine ID)

Machine ID Coordination Explained:

The Problem: Every server needs a unique machine ID to prevent collisions.

Bad scenario (no coordination):
Server A starts up → picks machine ID 5
Server B starts up → picks machine ID 5  ← Problem! Same ID!
Both generate IDs at same time → Collision guaranteed
Common coordination strategies:

Configuration Management:

# server-config.yml server_1: machine_id: 1 server_2: machine_id: 2 server_3: machine_id: 3

Service Discovery + Database:

# Server startup process def get_machine_id(): # Try to get existing ID from database existing_id = db.get("machine_id", server_hostname) if existing_id: return existing_id # Allocate new ID next_id = db.increment("next_available_machine_id") db.set("machine_id", server_hostname, next_id) return next_id

Container Orchestration:

# Kubernetes assigns unique IDs via environment variables docker run -e MACHINE_ID=${POD_ID} snowflake-service

Why this is "coordination": Unlike UUID (completely independent), Snowflake requires a setup step where servers agree on who gets which machine ID.

When to Use Snowflake:

Need sortable IDs (like Twitter timeline)
High-throughput systems (millions of IDs per second)
Distributed systems with known number of servers
Applications where ID structure can be exposed (IDs reveal timestamp)

Alternative Unique ID Strategies

Snowflake isn't the only solution for distributed ID generation. Different approaches have different trade-offs that make them better suited for specific use cases.

UUID (Universally Unique Identifier)

How it works: 128-bit random or pseudo-random numbers with extremely low collision probability.

UUID v4 Example: f47ac10b-58cc-4372-a567-0e02b2c3d479

Trade-offs:

✅ Advantages:
- Truly distributed (no coordination needed)
- No clock dependency
- Standardized across programming languages
- Can generate offline

❌ Disadvantages:
- Large size (16 bytes vs 8 bytes for Snowflake)
- Not sortable by time
- Poor database performance (random inserts)
- Not human-readable

When to use: Distributed systems where you can't coordinate machine IDs, offline applications, systems that don't need time-based sorting.

Why UUID hurts write performance: UUIDs are completely random, which causes problems for database B-tree indexes:

Sequential inserts (Snowflake/Auto-increment):
[Page 1: 1,2,3,4,5] → [Page 2: 6,7,8,9,10] → Always insert at end ✅

Random inserts (UUID):
[Page 1: uuid-a, uuid-m, uuid-z] → Inserting uuid-f requires page split ❌

Random inserts require the database to:

Find the correct position in the B-tree (more disk reads)
Split pages when inserting in the middle (expensive operation)
Create index fragmentation (reduces cache efficiency)

Performance impact: Sequential inserts can be 5-10x faster than random UUID inserts in write-heavy systems.

For a deeper understanding of why sequential writes perform better than random writes in databases, see Data Structures Behind Databases.

ULID (Universally Unique Lexicographically Sortable Identifier)

How it works: 128-bit IDs with 48-bit timestamp + 80-bit randomness, encoded in Base32.

ULID Example: 01F8VYXK67BGC1XPD2YH1W8HTH

Trade-offs:

✅ Advantages:
- Time-sortable like Snowflake
- Case insensitive, URL-safe encoding
- No machine ID coordination needed
- 48-bit timestamp = 8,925 years

❌ Disadvantages:
- Larger than Snowflake (26 characters vs 19 digits)
- Less throughput per millisecond
- Newer standard (less tooling support)

When to use: Need time-sortable IDs but can't manage machine ID coordination, want human-readable IDs.

Database Auto-Increment (with partitioning)

How it works: Use database sequences with different starting points and increments.

Example Setup:

Server 1: 1, 4, 7, 10, 13...  (start=1, increment=3)
Server 2: 2, 5, 8, 11, 14...  (start=2, increment=3)
Server 3: 3, 6, 9, 12, 15...  (start=3, increment=3)

Trade-offs:

✅ Advantages:
- Simple to implement
- Guaranteed sequential
- Small ID size (8 bytes)
- Native database support

❌ Disadvantages:
- Database becomes bottleneck
- Hard to add/remove servers
- Reveals information about scale
- Single point of failure

When to use: Smaller systems, when you need guaranteed sequential IDs, simple architectures.

Choosing the Right ID Strategy

Requirement	Snowflake	UUID v4	ULID	Auto-Increment
Time-sortable	✅	❌	✅	✅
No coordination	❌*	✅	✅	❌
High throughput	✅	✅	✅	❌
Small size	✅	❌	❌	✅
Human-readable	❌	❌	✅	✅
No clock dependency	❌	✅	❌	✅

*Snowflake needs machine ID coordination at setup, but no runtime coordination between servers

System Design Interview Decision Framework

Ask these questions to choose the right approach:

Do you need time-based sorting?
- Yes → Snowflake, ULID, or Auto-increment
- No → UUID v4
How many servers will you have?
- < 1,000 servers → Snowflake works well
- > 1,000 servers → Consider ULID or UUID
Can you coordinate machine IDs?
- Yes → Snowflake is a good choice
- No → ULID or UUID
Do you need IDs to be unguessable?
- Yes → UUID (fully random)
- No → Snowflake or ULID
Is this a read-heavy or write-heavy system?
- Write-heavy → Avoid UUID (random inserts cause B-tree page splits, 5-10x slower)
- Read-heavy → UUID is fine (insert performance doesn't matter much)

Real-world Examples:

Twitter: Uses Snowflake for tweets (need time-sorting, high scale)
Instagram: Uses modified Snowflake with different bit allocation
GitHub: Uses UUID for some resources, auto-increment for others
Stripe: Uses custom prefixed IDs (like cus_1234567890abcdef - not UUID, but their own format)

Understanding these trade-offs helps you make informed decisions in system design interviews and shows deeper architectural thinking.

When Standard Solutions Don't Fit: Custom ID Systems

Sometimes none of the standard approaches (Snowflake, UUID, ULID, Auto-increment) perfectly match your requirements. Many successful companies create custom ID formats tailored to their specific needs.

Stripe's Approach: Prefixed Random IDs

Format: {prefix}_{random_string}

Examples:

cus_1234567890abcdef    (Customer)
ch_3L4E5Y2eZvKYlo2C     (Charge)
sub_1A2B3C4D5E6F7G8H    (Subscription)
inv_1GjJKl2eZvKYlo2C    (Invoice)

Benefits:

Human-readable: Immediately know what type of object it is
URL-safe: Only uses characters that don't need encoding in URLs

What "URL-safe" means:

✅ Safe characters: a-z, A-Z, 0-9, hyphen (-), underscore (_)
❌ Unsafe characters: +, /, =, spaces, &, ?, #, %

Examples:
UUID:     f47ac10b-58cc-4372-a567-0e02b2c3d479  ✅ (hyphens are safe)
Base64:   SGVsbG8gV29ybGQ+              ❌ (+ and / need URL encoding)
Stripe:   cus_1234567890abcdef           ✅ (only letters, numbers, underscore)
Why this matters:

Can use directly in URLs: api.stripe.com/customers/cus_123abc

No encoding needed: fetch('/api/orders/ord_456def')

Avoids bugs from forgotten URL encoding

GitHub's Approach: Context-Aware IDs

Different ID strategies for different use cases:

Repository URL:   Human-readable name (user-facing: github.com/user/repo-name)
Issue ID:         Auto-increment per repo (issue #1, #2, #3...)
Commit SHA:       Git hash (distributed VCS requirement)

YouTube's Approach: Short Random IDs

Format: dQw4w9WgXcQ (11 characters, Base64-like encoding)

Benefits:

Short URLs: youtube.com/watch?v=dQw4w9WgXcQ
Non-sequential: Can't guess other videos by incrementing ID
High entropy: 11 characters = ~64 bits of randomness
Memorable: Short enough to share in text messages

When to Build Custom ID Systems

Consider custom IDs when none of the standard approaches fit:

User-facing requirements: Need human-readable or branded formats
Multiple object types: Want to distinguish customers from orders from products
API design: Need stable, external IDs separate from internal database IDs
Regulatory requirements: Need audit trails or specific ID formats
Migration needs: Want to change internal storage without breaking external APIs

Custom ID Generator

Build your own ID system like Stripe - experiment with different prefixes, lengths, and character sets

Loading Python environment... This may take a few seconds on first load.

Interview Talking Points for Custom IDs

When discussing custom ID systems in interviews:

Start with standard options: "I'd first evaluate Snowflake, UUID, and ULID..."
Explain why they don't fit: "But users need to share URLs, so UUID is too long"
Show trade-off thinking: "Custom means more code to maintain, but better UX was worth it"
Discuss implementation: "Used cryptographically secure randomness to prevent enumeration attacks"
Mention maintenance: "We'd need to ensure uniqueness and handle edge cases ourselves"