Log-Structured Merge on Flash

March 2023

This post is a draft. Content may be incomplete or missing.

During my time in Azure, I spent a lot of time thinking about log-structured merge trees. I also learned quite a bit about what goes on inside a flash SSD. Today, I wanted to write down a few thoughts about designing LSM Trees to run on SSDs.

By conventional wisdom, that’s not something you’d normally do. LSM Trees were designed primarily to address seek latencies on traditional rotating hard disks, and there is no such thing on flash; a priori, then, there isn’t a strong argument for using an LSM Tree to manage an index stored on flash. However, I think LSM Trees have some interesting properties worth considering, should you ever find yourself optimizing an storage system for flash. In particular, LSM Trees are famous for hard-to-tackle tail latency problems, and I think those are generally either non-issues or easy-to-solve’s on flash.

So we’re all on the same page, let’s start with a quick review what LSM Trees are and why they exist in the first place. This will give us better context to talk about whether LSM Trees even make sense for SSDs in the first place, let alone how we can tune an LSM to work well on an SSD:

Before Log-Structured Merge, We Had B-Trees and ARIES

(If you understand entirely what I mean by “b-tree index with ARIES write-ahead logging”, feel free to gloss over this section.)

Before we had flash SSDs, we had rotating magnetic hard disks. Log-structured merge trees were invented during this era, with the design goal of achieving better throughput on a rotational hard disk than the previous state of the art indexing structures. For the most part, the ‘previous state of the art’ involved various configurations combining b-tree indexes and ARIES-like write-ahead logs / journals.

For any storage device, old or new, you can break down the time cost of an I/O between a fixed amount of “setup” time, which is the overhead of getting the I/O going and later cleaning it up; and the “transfer” time of actually moving data between the disk and memory (disk to memory on read, memory to disk on write). In practice, transfer time dominates on large I/O, but setup time dominates for small I/O:

[ Abstract stacked line graph showing fixed setup cost and proportional transfer time cost. Circle the ‘small I/O’ area ]

One way of reading this graph is to say, up to a ‘certain I/O size’ (4KB-16KB in this example), the time cost of a read or write is constant (approximately). So, if you need to read a few bytes from this disk, you may as well go read the surrounding 4-16K block too, and cache the whole thing in memory. After all, it costs you nothing extra to read the additional data, and if you cache it in memory, maybe you can serve a future read without a trip to disk at all! This is what b-trees try to exploit.

There are many ways to think about b-trees; usually, people introduce them as a generalization that extends binary search-trees to $n$-ary tre, where each node branches in up to $n$ different child directions. However, I usually prefer to think in terms of b+ trees, which you can picture as chunked arrays. Imagine I gave you a set of key-value pairs; you could store these by just writing them to disk in sorted order.

[ diagram ]

Then, to exploit the observation above about setup time vs transfer time, we’ll pick a chunk size of about 4-16K and split the array into chunks of that size. To use standard parlance, we’ll call each chunk a “page.”

[ diagram, make sure to label “page” ]

To simplify our allocation logic, we’ll allow these pages to be scattered across the disk in arbitrary locations. But that means we need another secondary structure to track the disk offset of each page. A simple way to do this is to use another array. This second array will contains directors, each of which contains two things: a copy of the first key in some page, and the disk offset where that page is stored:

[ diagram ]

However, this secondary array could also be quite large, so we’ll want to chunk it into pages too:

[ diagram ]

To track these pages, we need another array of directors:

[ diagram ]

But, of course, that may need to be chunked too. You keep on like this, recursively, until you have a ‘root’ directory array that fits in a single page. The resulting recursive structure of chunked arrays is a b+ tree, which for almost all intents and purpose is the same as Bayer and McCreight’s original b-tree construct.

To search this thing for a particular key, all you need to do is start at the root node, search the directors for the gap where your key ‘should’ appear, and follow that director page. Eventually you land at a leaf page, which is just an array of key-value pairs you can search directly. Each node (director or leaf) is an array, so you can binary-search the array if the keys have a fixed size, or scan left-to-right otherwise:

[ diagram ]

This structure happens to be extremely cache-friendly. Since the cost of reading each page is essentially free compared to the cost of reading a single director or key-value pair, populating this cache is essentially free as well. Due to the high branching factor of the tree, the size of the internal subtree of directors is orders of magnitude smaller than the final leaf level, so usually the entire internal subtree of directors fits neatly in the cache and tends to stay there even with naive LRU eviction. Only a subset of the leaf pages can be cached at one point in time, but since the structure is sorted, all the keys in a single leaf page are ‘near’ one another, so you get good spatial and temporal locality just by caching recent leaf pages.

A b-tree itself is enough to build yourself a simple key-value store, but it’s not quite enough to implement something more advanced, like a SQL engine with ACID transactions. The problem has to do with crash recovery: if you have a single transaction that’s supposed to update two key-value pairs in two different b-tree pages, what do you do if the machine crashes after only one of the two pages have been updated on disk? This is what ARIES addresses with its write-ahead logging strategy.

The ARIES write-ahead log is a separate data structure on disk that stores informations about updates accepted by the system. With a write-ahead log, updates to the main b-tree can be lazy:

  1. The system initially writes the update to the write-ahead log (WAL)
  2. At this point, the system replies to the client, acknowledgling the write completed
  3. Later, the update is persisted to a b-tree page, and the WAL record is deleted

If we crash before writing the update to the WAL, it’s no-harm no-foul: everyone agrees the transaction never happened. If we crash after the b-tree is updated, we’re also good: everyone agrees the transaction did happen. However, if we recover the log and find updates that are in the log but not the b-tree, then we need to update the b-tree too before we move on, as the client may have already been told the transaction ‘happened.’ The requirement to replay events found in a write-ahead log are called redo semantics.

[ diagram of the whole system together here ]

The real ARIES protocol extends this to support multi-step transactions. If all the steps of a transaction make it out to the WAL, then the transaction becomes ‘permanent’ and must be redone like we described above. However, if only some of the steps have been completed, the entire transaction is ‘undone’ and none of the partial updates ever make it out to the b-tree. This is suffiicent to make transactions atomic (in the ACID, “it happens all or nothing” sense) even if the transaction touches many parts of the system.

The resulting setup, of an ARIES write-ahead log sitting in front of a b-tree written in-place on disk, was the de facto state of the art in databases throughout most of the 1990s and partway into the 2000s. Although both are database-specific, file systems developed analogous structures along similar lines: journals instead of WALs, and inode trees instead of b-trees. In fact, modern file systems have started to close this loop; btrfs is based on b-trees and has a WAL-like journal, for example.

Performance Characteristics of a B-Tree/ARIES Strategy

Say you had one of these b-tree/ARIES systems. What would the performance characteristics be? When you push this system to its limits, which part falls over first? For now, let’s focus on writes:

The system’s write latency largely comes down to the write-ahead log, since we acknowledge writes as successful back to the client as soon as the write is durable in the log. The client never sees the latency of updating the corresponding b-tree page(s), so clients don’t care how fast we can update b-tree pages.

However, the system’s write throughput is almost certain to be bottlenecked on b-tree page updates. On a rotating hard disk, the time it takes to move the needle to a different disk location is huge compared to the time it takes to transfer data once the needle is positioned. The write ahead log, being a small, sequential structure, minimizes seeking: once the needle is positions over the write-ahead log, adding more data to the log is fast. But the b-tree is huge, and updates may be scattered across many pages. In the worst case, each update in the log affects a different b-tree page, meaning you need to seek once per update.

In takeaway, then, the write throughput of this system is limited to how fast the disk can seek, not how fast it can write data! Ingesting updates into the write-ahead log might be fast and efficient, but that log can’t grow forever, and offloading log entries into b-tree pages causes the disk needle to thrash back and forth. At the limit, you can only update pages as fast as the disk can seek from page to page to page, and you can only accept client writes as fast as you can update pages.

This is where O’Neil et al came in with their log-structured merge design. The question is: can we improve throughput by not seeking so aggressively?

In Comes Log-Structured Merge

TODO amortization. Now we know the goal, which is to achieve the hard disk’s sequential write throughput. We do this by an amortization strategy: collect many updates, and then rewrite the entire b-tree (or a large portion thereof). It might be a lot of work, yes, but as long as you amortize enough updates, the actual overhead per client write can be low, and you can write this whole new tree in one big sustained I/O. Seek once, write hundreds of megabytes to gigabytes of tree! This is the log-structured part.

TODO overlays. Of course, you have to do something with those pending updates until you have enough to justify a big rewrite. You do this by maintaining ‘layers’ of the tree that overlay on top of one another. First you log, and build an in-memory overlay tree, then you checkpoint it to disk creating a b-tree. Now you overlay new writes in the memory table onto the on-disk b-tree. You keep going, and eventually have two b-trees. Now three overlay layers: memory, the new b-tree, the old b-tree. Eventually the b-trees have to be merged. This is the merge part.

TODO amortization and overlays give you log-structured and merge. Viewed more generally, an LSM Tree isn’t really a data structure, so much as a design pattern for working with other data structures. For example, you could have a “log-structured merge” hash table that uses a similar overlay-until-amoritzed-rewrite to rewrite a big hash table on disk or something.

TODO performance characteristics. Obviously, the extra overlays cost additional disk space, and since you have to read through all overlays on every client request, read amplification. But at least writes are fast. Right?

Tail Latency in Log-Structured Merge

TODO oops, there’s a nonobvious problem with write latency in LSM. Usually write overhead is pretty good, just like b-trees/aries, because you’re still write-ahead logging. Unfortunately, your background work of rewriting the tree on disk is extremely disruptive in a way that shows up as higher tails in your client’s write latency distribution.

Consider what happens when you tweak your policy so that the average size of a ‘run’ written to disk is twice as long. Well, from a throughput perspective, this is an improvement, because now you’re ex-filing data with half as many seeks, and seeks were the really slow thing in your SSD. However, when you’re writing a run, the disk can’t do anything else (lest you reintroduce the very seeks you were trying to avoid).

Assume for simplicity that the client’s writes are uniformly distributed in time. Then the number of requests that arrive while the disk is ‘hung up’ writing out a big run of tree data is proportional to how long the disk is hung up. If you make a run twice as large, twice as many requests get caught up waiting on the disk, and each request is waiting twice as long!

You’re left in one of those unfortunate “rock and a hard place” situations. You want to make runs longer to get better throughput, but doing so causes a superlinear increase to your latency tail. Trying to balance these can leave you with a “jack of all trades, master of none” system that has middling throughput and undesirably high latency tails …

… on hard disks.

Flash Storage is Built Different

TODO a little history, the introduction of NAND flash

NAND flash isn’t very impressive. It’s slow (think like the old USB keys you may have used), using it tends to break it (erase cycles, read disturb), not using it also tends to break it (link to articles), and for economic reasons you basically don’t get overwrite; you have to wipe the whole thing. Surprisingly, they managed to turn this into a rather interesting storage solution. The story feels like it has to be one of those cases where marketing asked for something insane and the engineers realized it actually was possible.

Here’s the idea: take several dozen of these cruddy little things and tape em together. Driving all these little memories like a big RAID 0 array gives you some pretty impressive latency and throughput, even if the little memories are individually not very impressive. Then add some software to make them look like hard disks to anyone on the outside world, and address the erasure and wearing problems to boot, and you get a device with some pretty impressive performance characteristics that weren’t possible before.

Pages, blocks, channels and flash translation layers. Mention further subdivision exists but is unimportant for our discussion. FTL maps a logical address space to physical pages scattered around the disk, attempting to stripe for write performance. Read-modify-write cycless when you overwrite a page partially.

Performance characteristics. Really good at a lots of little reads all happening in parallel, because then you can just have all these different little flash memories reading different stuff all in parallle. Decent at parallel writes, as long as you’re writing full pages at a time and overwriting pages as little as possible. The sequential vs random distinction is less important now because there’s now basically an arbitrary mapping between disk addresses and flash pages; in fact, the FTL is trying to spread pages around to achieve better parallelism anyways.

Do Log-Structured Merge Trees Make Sense on Flash?

TODO in the light of these very different disks, do LSM-Trees make sense? This is a subjective question, but I already told you I think the answer is yes.

Review the tradeoffs. You get the ability to write out data in large runs, and you get a few knobs to trade off the write amplification of updates vs the amount of space overhead you’re willing to spend to get your WAF even lower. The downsides are read amplification and space overhead.

The space overhead is a concern, but it’s a knob we can tweak, which is at least ok.

Read amplification is actually less of a concern now. SSDs are really good at high queue depth random reads! So if we need to page in more stuff because we have more overlays, we expect the SSD to have very little trouble keeping up. In general, there isn’t much of a disadvantage from the disk side of having more tree levels / overlays. If we’re missing parts of the tree on read, the SSD can get them all about as fast as it could get just one anyways.

But here’s the interesting thing: the last problem was latency tails, and I don’t think those are a problem either.

LSM Policies for Good Latency Tails on Flash

TODO the problem with tail latency stemmed from our need to write one big run to a hard disk without moving the needle. On an SSD, we don’t need to do everything in one big step. As long as we write in page-aligned chunks, it doesn’t matter if we issue those chunks consecutively or spread them out. And spreading them out ‘fixes’ our tail latency problem.

An example scheme based on choosing a ‘deadline’ for a task, like checkpointing the log or merging two trees, by forecasting when the next instance of this task will start. This can be guestimated, but if your system has throttling limits, you may be able to forecast the worst case (soonest deadline) precisely as (the amount of data the client must write to trigger the next instance of this task) divided by (the throttling limit by which clients can write more data). So if this task runs once per every 50,000 client writes, and the client is limited to 10,000 IOPS, then we have 5 seconds to finish this task.

Once a deadline is known, you split the work into small chunks, executed serially, where each chunk of work is the same order of magnitude CPU processing and disk I/O size as a typical client I/O. Then you space these out so they run at the minimum possible rate needed to complete by the deadline. For example, if you need to do 100 chunks of work and finish in 5 seconds, then you need to do a chunk every (5 seconds / 100 chunks) or roughly once every 20 milliseconds. That becomes your throttling rate for the background task.

This scheme minimizes the rate at which background work completes, which minimizes at any given point in time the CPU or disk resources being ‘taken away’ from foreground work by background amortization. The ‘deadline’ forecast is used to determine a minimal rate without ever falling behind on work. Furthermore, by doing work in small chunks on the same order of magnitude of a single I/O, we don’t steal any more resources per background I/O than a parallel foreground I/O would. From a client’s perspective, it just looks like there’s a little extra load on the server.

In other words, we get consistent background work overhead, which pretty much eliminates our latency tails. I haven’t tried this on a real database or benchmarked anything, but I’m reasonably confident this is worth a shot. In fact, I’d be surprised if nobody else has tried this yet.

LSMs vs FTLs

So is running an LSM Tree on an SSD worthwhile? Maybe. From the analysis above, I think these things are interesting at the very least. On the other hand, it’s worth noting that the design we’ve come up with is coming in spitting distance of what FTLs do in the first place.

Explain FTL designs. Mention e.g. the desire to spread writes spatially and temporally, and the need to overprovision a little extra space to deal with the indexing structure which maps things back. This should all sound very similar to the work that LSM Trees would do on flash. Indeed, using an LSM Tree is tantamount to bypassing the FTL.

The merge step of an LSM is a big read-modify-write, but we’re doing it in small chunks. The in-page update read-modify-writes of an SSD FTL accomplish the same thing. Question is, who can do it better? That’s a very hard question to answer in general, because FTLs are mystery meat. Flash vendors all try to beat each other on coming up with interesting, top-secret optimizations and they don’t share those with anyone, certainly not you or me. Heck, they don’t even tell us what the page size is, and that’s extremely relevant to anyone trying to align their write sizes to FTL-friendly boundaries! If an FTL is a black box, it’s hard to say whether moving it’s job up to the software layer is worthwhile.

However, one clear downside of an FTL is they’re not tunable. They’re making tradeoffs between latency, write amplification, wear and space overhead entirely internally, on your behalf, without giving you any insight into what tradeoffs were made or what knobs you can tune. if you’re a huge customer, you can certainly reach out to the vendor and ask for these things, but for you and me, moving the FTL up into software using an LSM Tree or something close to one gives us the ability to control the tradeoff, which may be a win regardless of whether FTLs have tricks to do this better than us.