I Built a B-Tree in Pure Python and Finally Understood Why Postgres Uses It for Every Index
I have used Postgres indexes for years. I ran EXPLAIN ANALYZE, watched seq scans drop to index scans, and called it good. But I never actually understood what was happening under the hood.
Then I had one of those 2am incidents. A query that touched 4 million rows slowed from 80ms to 14 seconds after an insert spike. The DBA said "B-tree bloat." I nodded like I understood. I did not.
So I did what I always do when I do not understand something. I built it from scratch in pure Python.
No dependencies. No shortcuts. Just the data structure itself.
Here is what I learned.
Why B-Trees, Not Binary Trees?
Most of us learn binary search trees in school. Each node holds one value. Left child is smaller, right child is larger. Searching is O(log n). Clean.
But databases hate binary trees. Here is why.
A binary tree with 1 million nodes has a height of about 20. That means 20 disk reads to find one row. Each disk read is roughly 1ms on a spinning disk (or 100 microseconds on SSD). That adds up fast.
B-trees solve this by being "wide not tall." Each node holds multiple keys and has multiple children. A B-tree of order 100 (100 keys per node) with 1 million entries has a height of only 3. Three disk reads. That is why Postgres uses it.
The rule: every internal node holds between t-1 and 2t-1 keys, where t is the minimum degree. Leaf nodes have the same constraint. This keeps the tree balanced automatically.
The Data Structure
class BTreeNode:
def __init__(self, leaf=False):
self.keys = [] # sorted list of keys
self.children = [] # child node references (empty if leaf)
self.leaf = leaf
self.values = {} # key -> value mapping (for leaf nodes)
class BTree:
def __init__(self, t=3):
self.root = BTreeNode(leaf=True)
self.t = t # minimum degree
The minimum degree t controls how wide the tree is. With t=3, each node holds 2 to 5 keys. With t=100 (like Postgres uses internally), each node holds 99 to 199 keys. Bigger t means shorter trees. Shorter trees mean fewer disk reads.
Searching
Search is the simplest operation. Walk down the tree, at each node binary-search the keys, go left or right depending on where the target falls.
def search(self, node, key):
# find the first key greater than or equal to our target
i = 0
while i < len(node.keys) and key > node.keys[i]:
i += 1
# found exact match
if i < len(node.keys) and key == node.keys[i]:
if node.leaf:
return node.values.get(key)
# internal nodes do not store values in a real DB
# they just guide the search
return self.search(node.children[i + 1], key)
# not found at this node, go to appropriate child
if node.leaf:
return None # key does not exist
return self.search(node.children[i], key)
def get(self, key):
return self.search(self.root, key)
Notice the logic: at each node, we scan the keys list to find where our target key fits, then either return the value (if found at a leaf) or recurse into the right child. Height is O(log_t n), so the number of recursive calls is tiny even for millions of rows.
Insertion and Node Splitting
This is where B-trees get interesting. Inserting into a full node would violate the max-keys constraint. The fix: split the node before inserting.
The split operation takes a full child node, cuts it in half, and promotes the median key up to the parent. This is why B-trees stay balanced without any rotation logic (unlike AVL trees or red-black trees).
def split_child(self, parent, i):
"""Split parent.children[i] and promote median key to parent."""
t = self.t
full_child = parent.children[i]
new_child = BTreeNode(leaf=full_child.leaf)
# median key index
mid = t - 1
# promote median key to parent
parent.keys.insert(i, full_child.keys[mid])
parent.children.insert(i + 1, new_child)
# right half goes to new_child
new_child.keys = full_child.keys[mid + 1:]
full_child.keys = full_child.keys[:mid]
# split children pointers if internal node
if not full_child.leaf:
new_child.children = full_child.children[t:]
full_child.children = full_child.children[:t]
# split values if leaf node
if full_child.leaf:
for k in new_child.keys:
new_child.values[k] = full_child.values.pop(k)
Now insertion uses a top-down approach. As we walk down to find the insertion point, we split any full node we encounter. This guarantees that when we reach a leaf, it always has room.
def insert_non_full(self, node, key, value):
i = len(node.keys) - 1
if node.leaf:
# insert key in sorted position
node.keys.append(None)
while i >= 0 and key < node.keys[i]:
node.keys[i + 1] = node.keys[i]
i -= 1
node.keys[i + 1] = key
node.values[key] = value
else:
# find child to recurse into
while i >= 0 and key < node.keys[i]:
i -= 1
i += 1
# split child if full
if len(node.children[i].keys) == 2 * self.t - 1:
self.split_child(node, i)
if key > node.keys[i]:
i += 1
self.insert_non_full(node.children[i], key, value)
def insert(self, key, value):
root = self.root
if len(root.keys) == 2 * self.t - 1:
# root is full, need to grow the tree
new_root = BTreeNode(leaf=False)
new_root.children.append(self.root)
self.split_child(new_root, 0)
self.root = new_root
self.insert_non_full(new_root, key, value)
else:
self.insert_non_full(root, key, value)
The root split is the only time tree height increases. That single growth path is what keeps a B-tree perfectly balanced at all times.
Range Queries
Databases use B-trees for range queries too. WHERE created_at BETWEEN '2024-01-01' AND '2024-03-31' is a B-tree range scan.
def range_query(self, node, low, high, results):
"""Collect all key-value pairs where low <= key <= high."""
i = 0
while i < len(node.keys) and node.keys[i] < low:
i += 1
while i < len(node.keys) and node.keys[i] <= high:
if node.leaf:
results.append((node.keys[i], node.values[node.keys[i]]))
else:
# recurse into left child first
self.range_query(node.children[i], low, high, results)
if node.keys[i] <= high:
results.append((node.keys[i], node.keys[i])) # internal key
i += 1
# recurse into rightmost relevant child
if not node.leaf and i < len(node.children):
self.range_query(node.children[i], low, high, results)
def get_range(self, low, high):
results = []
self.range_query(self.root, low, high, results)
return sorted(results)
Notice we skip subtrees entirely when their key range does not overlap our query. That is the "index skip scan" behavior you see in EXPLAIN ANALYZE output.
Putting It Together: A Quick Demo
bt = BTree(t=3)
# insert some user IDs with usernames
for uid, name in [
(1, "alice"), (5, "bob"), (10, "carol"),
(3, "dave"), (7, "eve"), (12, "frank"),
(2, "grace"), (8, "heidi"), (15, "ivan"),
(4, "judy"), (6, "kate"), (11, "leo"),
]:
bt.insert(uid, name)
# point lookup
print(bt.get(7)) # eve
# range query
print(bt.get_range(3, 10))
# [(3, 'dave'), (4, 'judy'), (5, 'bob'), (6, 'kate'),
# (7, 'eve'), (8, 'heidi'), (10, 'carol')]
Both operations are fast regardless of total size, because the tree height stays low.
What This Taught Me About Real Postgres Behavior
Once I understood the structure, several things clicked.
B-tree bloat happens when many deletes leave half-empty pages. Postgres marks those pages as "dead" but does not immediately reuse them. The tree gets taller and pages get fragmented. VACUUM and REINDEX compact those pages back down. That 2am incident made total sense now.
Why composite indexes have column order rules. A composite index on (city, created_at) is a B-tree sorted first by city, then by created_at within each city. A query on WHERE city = 'Nairobi' can use the index. A query on WHERE created_at > '2024-01-01' without filtering by city cannot, because the tree is not sorted by created_at at the root level.
Why LIKE 'nairobi%' uses an index but LIKE '%nairobi' does not. Prefix matching aligns with how the B-tree is sorted. Suffix matching would require scanning every leaf.
The fill factor setting. Postgres lets you set a fill factor per index (default 90%). This leaves 10% of each page empty during initial build, reserving room for inserts. If your table gets heavy insert traffic, a lower fill factor reduces splits. If it is mostly read-heavy, keep it at 90%.
Build the Toy Version of What You Index
Before this exercise, I treated Postgres indexes as a black box. Point it at the right columns, check EXPLAIN, move on.
Now I understand why the column order in a composite index matters, why VACUUM affects query performance, and why a B-tree beats a hash index for range queries but loses for equality-only lookups.
The toy did not replace Postgres knowledge. It made everything else make sense faster.
Next time your EXPLAIN plan shows an index scan you do not understand, ask what the B-tree is doing at that step. The answer is usually "skipping subtrees you do not need." That is the whole trick.
If you found this useful, follow me here on dev.to. I share one of these "I built X to understand Y" posts every week. Next up: a mini query planner that chooses between seq scan and index scan based on table statistics.
What is the weirdest index behavior you have had to debug? Drop it in the comments.











