Front matter and motivation

The Key Problem

Hint: it’s not storing data (hardware does a fine job of that already). Software just needs to funnel data and not ‘get in the way’ of hardware

The key challenge is efficiently finding data - a client provides an identifier, the system finds the data for that identifier and returns it. This must be efficient.

In other words, storage is actually a search problem. The write algorithm aims to (efficiently) set up for reads (to also be efficient)

Think of all storage systems as fundamentally key/value mappings. All storage systems have keys, but sometimes the key is heavily disguised

  • Want to load a web page? You need to provide its URL
  • Want to read a file? You need to provide the file’s path
  • When to read from memory? You need to provide the memory’s address
  • Want to read a row from a database? You need the row’s key
  • And so on

All storage systems need keys because you can’t deal with all the data all the time. Keys allow you to amplify knowledge: if have a small piece of information (a key) you can obtain a larger piece of information (a record). The record might have additional keys you can use to learn more.

With this design, a storage system is just a key/value mapping

Optimizing and Tradeoffs

Optimization: you have complete controls of your data structures and algorithms, none over the client workload (the pattern of read and write requests generated by a client). Many strategies for optimizing: it’s hard, tradeoffs can be subtle!

Metrics you want to optimize for: latency (more properly, response time) measured in milliseconds (1/1000th) or microseconds (1 millionth); throughput, measured either in operatios per second or bytes per second. Maybe mention OLTP (latency/IOPS focused) vs OLAP (bandwidth focuoed)

Data Strutures for Storage

What we’re trying to do is built a key-value map data structure stored on disks. What are some mapping data structures we already know?

Hash Tables

Binary Search Trees

Maps for Disks

Disks provide the same basic interface to the computer as memory does: an array of bytes:

Diagram

So the data structures we discussed above, which are typically built in memory, you can also store on a disk.

There’s a catch though: disks might have the same interface as memory, but they perform quite differently!

Disk Performance

The different parts of your computer don’t run at the same speed.

Copy table from, and link to https://www.prowesscorp.com/computer-latency-at-a-human-scale/

Caching

We see that RAM is a lot faster than disk, so maybe keep data in RAM instead of disk so we can retrieve it quickly. But there’s a lot less space in RAM than there is on disk, so we have to cache selectively, guessing what will be useful (what will be read soon).

Cache heuristics: “temporal locality:” workload likely to revisit data it used before. “spatial locality:” workload likely to iterate over keys in ascending or descending order.

Hash Tables and Spatial Locality

Since we want to cache data, hash tables aren’t a good fit for our storage system. They purposefully scramble data in an attempt to minimize hash collisions:

Diagram: A B C D E -> 3 1 5 4 2 -> C A E D B

Binary trees keep data sorted in key order, maintaining a degree of spatial locality, making them closer to want we want out of a key-value mapping structure

Another Observation

For small I/O sizes, the time it takes to read from a disk is roughly constant

Find a way to show the graphs we made in excel here

Conclusion: a search data structure for disks should minimize the number of disk operations, even if that means each individual read operation reads more data than is strictly needed. This is true as long as the keys and values are small (<64KB maybe, per the graph above)

Binary Trees and Disk Accesses

Binary trees have good asymptotic complexity, but the absolute number of disk accesses is very high:

(20ms per read) x (25 reads per lookup) x (15 lookups) = 7.5 seconds (!)

Imagine that was what it took to read just one of the posts in your Facebook/Twitter/Instagram/etc feed … are you willing to wait minutes for your feed to load?

B-Trees

Why not take a binary search tree, and ‘smoosh’ it down by packing in many key/value pairs into a single node?

Diagram

We choose a node size around that sweet spot in the graph above (typically somewhere between 8K and 64K) and pack as many items into each node as will fit. We still use a binary search, but there are fewer individual nodes to read in, because each node has many values.

To search, we read in the root node, binary-search it (as an array), and follow the child link to another node, which we then binary-search (as an array), recurisvely, until we either find the key being queried or an empty space where the key would be if we had it:

Diagram

Constructing a b-tree is a little more involved, but certainly nothing we can’t handle Wikipedia has a good writeup (link)

Nice Properties of B-Trees

By packing many key-value records into a single node, we greatly reduce the number of individual nodes needed to store the tree. This allows us to store large amounts of data with a reasonable number of disk accesses.

B-tree nodes are also cache friendly. A simple scheme is, every time you read in a node, you add it to an in-memory cache. This is simple, and gives you both spatial locality (because keys in a node are a sorted and adjacent) and temporal locality (becuase you’re caching nodes that were read recently).

The Ubiquitous B-Tree

B-trees are a really good data structure for storage systems, and they’re used for basically everything: databases, file systems, caches and more.

Many variants designed with different user cases in mind. Link to some

Let’s see how the interfaces and performance characteristics of many classes of storage systems fall out of b-trees:

Let’s Build a Database

Overview showing what a database table looks like, with a diagram

We can decompose a row into a key-value mapping by examininng the table schema

Diagram

A table, then, is a sorted array of this key-value mapping.

So then the obvious approach is to store the database as a b-tree of rows.

Diagram showing the example table on one side, the same rows in a b-tree on the other side

Each record is a row, which can be decomposed into both a key and a value as described above.

Let’s look at how common queries work

  • Insert - show how this is a row insert
  • Select - with a single key
  • Select - with a range of keys
  • Delete -

To complete your database now all you need is a transaction manager (which locks ranges of keys that are being operated on, to provide isolation) and a query parser (giving users an interface to your database)

Huzzah! We solved storage systems … or did we? How do you handle data that’s larger than what can fit in a single b-tree node?

This is the purview of a different, but closely related kind of storage system . . .

File Systems

Databases vs file systems: similar in principle, differ in target workloads. Databases optimize for large numbers of relatively small ‘records’ (rows), whereas file systems optimize for smaller numbers of relatively larger ‘files.’ Plenty of overlap.

Although file systems are not tied to b-trees as closely as databases usually are, many modern file systems use b-trees ubiquitously (such as btrfs on Linux, ReFS on Windows?, AFS on macOS? Others worth mentinoinng?)

Handling Large Items

How does a file system handle these large items? You can’t stick them in your b-tree because they’re often much larger than any particular node of your b-tree.

Think about the interface a disk provides, and fundamentally how a b-tree is laid out on disk

Diagram showing a byte array, b-tree nodes in that byte array pointing to each other

Idea: drop data anywhere on disk you have room to put it, then use a b-tree record to point to the data:

Diagram

Naming Hierarchy

Very quick overview of files and directories. Show how a b-tree is a good way to handle this

File Contents

Show how to use a b-tree to index over a file’s contents, which are stored as blocks elsewhere on disk

Disk Allocation

Explore the problem of allocating space on a disk, both for allocating blocks of space for file contents and for allocating blocks of space for b-tree nodes. Note that a b-tree can even be used for this.

Mention fragmentation, this is more than just a data structures problem, very similar to writing a memory allocator since disks are very similar to memory

Where To Next

Other concerns like resiliency (checksums), concurrency control

New unconventional types of storage devices which meld performance aspects of disks and memory

Append-only stores and the unique challenges they pose

Distributed stores and the unique challenges they pose

Additional Reading

Plug DDIA