Data Structures Behind Databases

Note that this section is optional and provides an in-depth exploration of the data structures utilized in the construction of modern databases. It is not mandatory to possess this level of knowledge in order to solve system design questions. However, acquiring an understanding of these concepts will enhance your appreciation of various technologies and facilitate your comprehension of their documentations and learning new technologies.

Understanding the storage engine, the underlying technology that powers databases, may seem like an endeavor meant only for database developers. However, as a software engineer or a system designer, knowing these intricacies has significant benefits and often gives you a greater appreciation of the technology. It’ll help you with:

  1. Database Selection: A wide variety of databases are available in the market, each catering to different needs and use-cases. Understanding the fundamental principles of how these databases store, retrieve, and manage data can help you make an informed decision when selecting a database for your specific needs. For instance, if your application requires heavy write operations, you might opt for a database that uses Log-Structured Merge (LSM) trees, which are optimized for write operations.
  2. Performance Tuning: Databases offer a myriad of configuration options that can significantly impact their performance. Having a solid understanding of the underlying storage engine can guide you in tuning these settings for optimal performance. For instance, understanding the principles of B-Trees and how they affect read and write operations can help you optimize the page size or the fill factor of your database.

In the following sections, we will build a database from scratch (!) and cover the most important data structures powering your databases: hash index, SStable, LSM tree and B-tree.

Sequential Writes vs Random Writes

First of all, I want to stress one key point when we discuss in storage design: Sequential writes are generally faster than random writes due to several factors. In hard disk drives (HDDs), sequential writes minimize the need for the read/write head to move, reducing seek time and rotational latency. For solid-state drives (SSDs), sequential writes allow for more efficient wear leveling and are easier to optimize, as they do not require the frequent changing of write locations that random writes do. File systems are also optimized for sequential writes, allowing for contiguous storage allocation and reducing fragmentation. Additionally, caching and buffering mechanisms are more effective with predictable sequential write patterns, enabling faster data processing. Lastly, storage controllers are designed to handle sequential data flows more efficiently than random ones, further enhancing performance.

The differential performance characteristics between sequential and random writes have significant implications for database design.

Building a database from scratch

Simply speaking, a database stores data, neither on the disk or in memory. You can implement a simple key-value database in a few lines of bash scripting.

#!/bin/bash DB_DIR="$HOME/db" # Check if the database directory exists, and if not, create it if [ ! -d "$DB_DIR" ]; then mkdir "$DB_DIR" fi # Function to set a key-value pair set_key() { key=$1 value=$2 # Check if the key already exists, and if so, overwrite the value echo "$value" > "$DB_DIR/$key" } # Function to get the value for a key get_value() { key=$1 # Check if the key exists, and if not, return an error message if [ ! -f "$DB_DIR/$key" ]; then echo "Error: Key $key does not exist" return 1 fi cat "$DB_DIR/$key" } # Function to delete a key-value pair delete_key() { key=$1 # Check if the key exists, and if not, return an error message if [ ! -f "$DB_DIR/$key" ]; then echo "Error: Key $key does not exist" return 1 fi rm "$DB_DIR/$key" }

This creates a directory called db. For each key value pair, it writes a file in the directory with file name being the key and content being the value. To look up a key, it scans the directory for a file whose name is the same as the key.

Example commands to use it:

# Set a key-value pair ./db.sh set_key key1 value1 # Get the value for a key ./db.sh get_value key1 # Delete a key-value pair ./db.sh delete_key key1

This is either efficient nor elegant but it offers a view of how a database fundamentally works: we serialize the data into a storage media and retrieve it after. The devil is in the details of how to serialize, where to storage, how to read it back so it’s most efficient.

SSTable: Sorted String Table

The bash database implementation we have so far is very slow to find data because it requires a sequential search. How do we make searching faster? The good old computer science concept of sorting comes to mind first.

An SSTable is a simple yet efficient data structure, particularly suitable for handling static indexes and massive, sorted datasets. It is essentially a file containing a set of arbitrary, sorted key-value string pairs. Reading in the entire file sequentially provides a sorted index, and we can use binary search to find the key we want in log(n) time. To speed up access for large files, an optional key:offset index can be prepended or created separately. We could even memmap the entire file to memory if we want great reading speed.

sorted-string-table

Sorted String Table

Here's a basic implementation of an SSTable in Python:

import os
import bisect

class SSTable:
    def __init__(self, filename):
        self.filename = filename
        self.keys = []
        if os.path.isfile(filename):
            with open(filename, 'r') as file:
                offset = 0
                for line in file:
                    k, v = line.strip().split(':', 1)
                    self.keys.append((k, offset))
                    offset += len(line)
        self.keys.sort()

    def get(self, key):
        idx = bisect.bisect_left(self.keys, (key,))
        if idx < len(self.keys) and self.keys[idx][0] == key:
            with open(self.filename, 'r') as file:
                file.seek(self.keys[idx][1])
                k, v = file.readline().strip().split(':', 1)
                return v
        else:
            return None

    def put(self, key, value):
        with open(self.filename, 'a') as file:
            offset = file.tell()
            file.write(f'{key}:{value}\n')
            self.keys.append((key, offset))
            self.keys.sort()

# Usage
sstable = SSTable('sstable.txt')
sstable.put('key1', 'value1')
sstable.put('key2', 'value2')
print(sstable.get('key1'))  # 'value1'
print(sstable.get('key2'))  # 'value2'

The problem with SSTable is that random write cannot be done efficiently. Once an SSTable is on disk it’s essentially immutable and random write would require moving everything after the write position over by an offset and incur a large I/O cost.

LSM Tree

What if we want both fast sequantial read and writes? We could simply store the SSTable in memory first and then flush it to disk when the data reach a certain size. The in-memory SStable table is called a MemTable.

lsm-tree

LSM Tree

Here’s an implementation:

import os import uuid class LSMT: def __init__(self, dir): self.dir = dir if not os.path.isdir(dir): os.makedirs(dir) self.memtable = {} self.sstables = [SSTable(os.path.join(dir, filename)) for filename in os.listdir(dir)] def get(self, key): if key in self.memtable: return self.memtable[key] for sstable in self.sstables: value = sstable.get(key) if value is not None: return value return None def put(self, key, value): self.memtable[key] = value if len(self.memtable) > 1000: # Just a placeholder limit self.flush() def flush(self): filename = os.path.join(self.dir, str(uuid.uuid4())) sstable = SSTable(filename) for key, value in sorted(self.memtable.items()): sstable.put(key, value) self.sstables.append(sstable) self.memtable.clear() # Usage lsmt = LSMT('lsmt') lsmt.put('key1', 'value1') lsmt.put('key2', 'value2') print(lsmt.get('key1')) # 'value1' print(lsmt.get('key2')) # 'value2'

Marking Data as Obsolete using Tombstone

Note that our implementation so far does not support updates and deletes. Once an SSTable is written to disk, it becomes immutable, meaning that existing data can't be directly updated or deleted. Instead, if an update is needed, the newer value is simply stored in the MemTable. In the case of a deletion, a special record known as a "tombstone" is appended.

The system checks the indexes sequentially. Therefore, during subsequent reads, it encounters the updated value or the tombstone record before it reaches any older values. This effectively allows us to perform updates and deletions on the data without directly modifying the SSTable.

Merge SSTables

However, having hundreds of SSTables on disk isn't efficient. To address this, we periodically run a process that merges these on-disk SSTables. During this merge process, the updated values and tombstone records overwrite and remove the outdated data, maintaining the system's efficiency and data integrity.

Here’s an implementation of the “tombstone” and merging maintenance idea:

import os import uuid class SSTable: def __init__(self, filename): self.filename = filename self.index = {} if os.path.isfile(filename): with open(filename, 'r') as f: for offset, line in enumerate(f): key, value = line.strip().split(',') self.index[key] = (offset, value != '<TOMBSTONE>') def get(self, key): if key in self.index: offset, exists = self.index[key] if exists: with open(self.filename, 'r') as f: f.seek(offset) _, value = f.readline().strip().split(',') return value return None def put(self, key, value): with open(self.filename, 'a') as f: offset = f.tell() f.write(f'{key},{value}\n') self.index[key] = (offset, value != '<TOMBSTONE>') class LSMT: def __init__(self, dir): self.dir = dir if not os.path.isdir(dir): os.makedirs(dir) self.memtable = {} self.sstables = [SSTable(os.path.join(dir, filename)) for filename in os.listdir(dir)] def get(self, key): if key in self.memtable: value = self.memtable[key] if value == '<TOMBSTONE>': return None return value for sstable in reversed(self.sstables): value = sstable.get(key) if value is not None: return value return None def put(self, key, value): self.memtable[key] = value if len(self.memtable) > 1000: # Just a placeholder limit self.flush() def delete(self, key): self.put(key, '<TOMBSTONE>') def flush(self): filename = os.path.join(self.dir, str(uuid.uuid4())) sstable = SSTable(filename) for key, value in sorted(self.memtable.items()): sstable.put(key, value) self.sstables.append(sstable) self.memtable.clear() def merge(self): merged = SSTable(os.path.join(self.dir, str(uuid.uuid4()))) entries = {} for sstable in self.sstables: for key, (offset, exists) in sstable.index.items(): if key not in entries or entries[key][1] < offset: entries[key] = (offset, exists) for key, (offset, exists) in sorted(entries.items(), key=lambda x: x[1][0]): if exists: merged.put(key, self.get(key)) else: merged.put(key, '<TOMBSTONE>') self.sstables = [merged]

This is a close to the final form of LSM which powers most NoSQL databases like Google BigTabe, Amazon DynamoDB, Apache Cassandra and Google LevelDB etc.

B-tree

If you prefer video, here’s a 2-min video explaining b-tree:

https://www.youtube.com/watch?v=8E4RRzMTzpM

B-Tree, short for Balanced Tree, is a self-balancing, sorted, tree-based data structure that maintains sorted data and allows for efficient insertion, deletion, and search operations. It's particularly useful for systems with large amounts of data stored in external storage like hard drives.

B-Tree is an extension of the binary search tree, but instead of having 2 child nodes, B-Tree nodes can have a variable number of child nodes within a pre-defined range.

Here is how B-Tree works:

Properties:

  1. Order of the B-Tree: Every B-Tree has an order m which is defined as the maximum number of children a node can have. A B-Tree of order m can have m-1 keys and m children. An internal node (not root or leaf) must have at least ⌈m/2⌉ children.
  2. Root Node: If the tree is not empty, then the root must have at least one key. The root is the only node that can have fewer than ⌈m/2⌉ children. It can have between 2 children (if not a leaf) and m children.
  3. Internal Nodes: These nodes can have between ⌈m/2⌉ and m children, or ⌈m/2⌉-1 and m-1 keys.
  4. Leaf Nodes: All leaf nodes appear in the same level, and carry no information regarding branch nodes.
  5. Keys: All keys within a node are sorted in increasing order. The key in the i-th position in a node is greater than all keys stored in its first i children, but less than the keys in its remaining children.

Operations:

  1. Search: The search operation on a B-Tree is similar to a binary search tree. Starting from the root node, we compare the keys in the node with the key we are searching for. If it matches any, we return. If not, we pick a child based on the key's value and search in that subtree.
  2. Insertion: The insertion operation ensures the B-Tree properties are maintained after inserting a new key. We start from the root and find the appropriate leaf where the new key should be inserted. If the leaf has less than m-1 keys, we insert at the correct position. If the leaf is full, we split it into two leaves and move the middle key to its parent. If the parent is also full, we continue this process upwards until we reach a node that is not full.
  3. Deletion: The deletion operation is a bit more complex, as it must ensure that the B-Tree properties are still maintained after deletion. This may involve borrowing an element from a neighboring sibling node or merging sibling nodes together.

Use-Cases:

B-Trees are widely used in filesystems and databases. They are particularly suited to systems where reads and writes are expensive, such as with a hard drive, because they minimize disk reads/writes and keep the amount of data that must be read from disk to a minimum. Relational databases such as MySQL, PostgreSQL, and SQLite all use B-tree as the storage engine.

Here’s an implementation

class BTreeNode: def __init__(self, leaf=False): self.leaf = leaf self.keys = [] self.child = [] class BTree: def __init__(self, t): self.root = BTreeNode(True) self.t = t # Insert node def insert(self, k): root = self.root if len(root.keys) == (2*self.t) - 1: temp = BTreeNode() self.root = temp temp.child.insert(0, root) self.split_child(temp, 0) self.insert_non_full(temp, k) else: self.insert_non_full(root, k) # Insert nonfull def insert_non_full(self, x, k): i = len(x.keys) - 1 if x.leaf: x.keys.append((None,None)) while i >= 0 and k < x.keys[i]: x.keys[i+1] = x.keys[i] i -= 1 x.keys[i+1] = k else: while i >= 0 and k < x.keys[i]: i -= 1 i += 1 if len(x.child[i].keys) == (2*self.t) - 1: self.split_child(x, i) if k > x.keys[i]: i += 1 self.insert_non_full(x.child[i], k) # Split the child def split_child(self, x, i): t = self.t y = x.child[i] z = BTreeNode(y.leaf) x.child.insert(i+1, z) x.keys.insert(i, y.keys[t-1]) z.keys = y.keys[t: (2*t) - 1] y.keys = y.keys[0: t-1] if not y.leaf: z.child = y.child[t: 2*t] y.child = y.child[0: t-1]

Note that this is a very simplified implementation that doesn’t handle re-balancing.

LSM-tree vs B-tree and the Importance of Sequential Search

When it comes to sequential search, which means looking at items one after the other, LSM Trees are usually faster for writing data. Since you're just adding notes to the board, you don't worry about organizing them right away. But for reading data, B-Trees are often faster. Since everything is already organized in the filing cabinet, finding what you need is quicker.

In a nutshell, if you're doing a lot of adding new data and less reading, LSM Trees might be more efficient. But if you're often reading data, B-Trees can be faster due to their organized structure. This difference is crucial in databases where the speed of reading and writing data can significantly impact the overall performance.


TA 👨‍🏫