RocksDB: October 2013

Introduction
I have been measuring the performance of a number of storage systems and here is a post of what I have learnt so far. I do not claim that any of this is new, it is written from my perspective and is a very opinionated piece.

Database and Storage Systems have evolved over the years, starting from storing data in spinning-disk based storage system to solid-state-storage (SSD) and random-access-memory (RAM) storage. These storage hardware have very different performance characteristics. The difference in the physical constraints of these storage hardware means that we might need to use different methodologies to measure their performance. In the following discussions, I refer to OLTP workloads where each record is small (smaller than 4K) and they are read randomly.

Spinning Disks
Spinning disks can typically sustain 100 to 200 IO per sec. Random accesses are typically bottlenecked by the number of IOs that the disk can service. The sequential access speeds of disks can reach upto 80 MBytes/sec. Most database system that run on spinning disks are optimized to reduce the number of random reads to storage that it consumes to service a specified workload. Btree implementations (like Innodb, Berkeley-DB, etc) tries to reduce the number of random reads by caching some part of the working set in RAM. LSM databases (like Apache HBase) converts random writes to sequential writes to get better database performance. For all these systems, reducing the number of random disk access directly improves performance, especially when the workload is mostly small random reads or short scans.

Solid State Devices (SSD)
SSDs have physical constraints that are different from spinning disks. An SSD can typically perform about 60K to 100K IO per second. This is orders of magnitude larger than what a spinning disk can possibly do. The throughout of an SSD can very from 300MBytes/sec to 600 MBytes/sec depending on the manufacturer and the percentage of free space on the SSD.

If a storage software issues 100K IO per second and each IO is 4 K, then the total data bandwidth needed is 400 MBytes/sec which is close to the maximum throughput of the SSD. If you run a OLTP benchmark on SSDs, it is likely that you are bottlenecked because of one of two reasons:
(1) you have used up all the random IOs per second offered by the device or
(2) you have maxed out the SSD data bandwidth.
It could be either of these two constraints. If you are reaching the maximum data throughput of this device, then reducing the number of random accesses to storage might not improve performance for your system. You need to reduce the total storage bandwidth too.

So, what consumes the storage bandwidth of the device? This bandwidth is consumed by reads and writes done by the storage software on behalf of the application.

Read Amplification (RA)
Read Amplification is the ratio of the number of storage bytes accessed to satisfy a single read request of 1 byte. For example, if a btree-based storage system has to read a single 4K size page from storage (assuming that the index pages are cached in RAM) for every read request of 1 byte, then it has a read amplification of 4K. On the other hand, if an LSM based database needs to consult three storage files to satisfy a single read request of size 1 byte and the block size of each file is 4K, then its read amplification is 12K (ignoring the existence of bloom filters). My colleague Mark Callaghan has an awesome post about Read Amplification.

Write Amplification (WA)
Write Amplification is the ratio of the number of storage bytes accessed to satisfy a single 1 byte write request to the database. WA includes the write amplification caused by the database software as well as the write amplification generated by the SSD itself.

For example, if a btree-based storage system has to write 1 byte into the database, it has to read in the 4K page into memory, update it and then write it back, thereby incurring a write amplification of 8K. A btree-based storage-system typically overwrites the SSD page in place thereby incurring additional write amplification in the SSD layer too.

On the other hand, an LSM storage engine typically writes the new 1 byte to a new place on SSD and has a write amplification of 1 (not including compaction).

For transaction logging, a btree-based system needs a redo log and an undo log which means that WA is further increased by 2 for these systems. An LSM based system needs only an redo log which causes WA to increase by 1.

But the above does not mean that LSM systems are better. Before I try to explain why it is so, please allow me to write about Space Amplification.

Space Amplification (SA)
Space Amplification is the number of storage bytes needed to store 1 byte of information. Storing multiple indices of the same data increase the storage requirements but could decrease read latencies. Space Amplification is also caused by the internal fragmentation and padding. An LSM database can have the same key with older versions of the data in multiple files... this too can cause Space Amplification.

The SPAce, Read Or Write theorem (SPARROW)

The SPARROW Theorem states that:
1. RA is inversely related to WA
2. WA is inversely related to SA

If you want to reduce the Read Amplification caused by a specific workload, then you can achieve it only if you incur higher Write Amplification. Conversely, if you want to reduce the Write Amplification caused by a specific workload, then you have to incur higher Read Amplification to achieve that goal. (This assumes that we maintain Space Amplification constant at all times)

Similarly, the only way to reduce Space Amplification caused by a specific workload is to tune the system in such a way that it can sustain a higher Write Amplification. Conversely, if you want to reduce Write Amplification, then you have to incur higher Space Amplification for the same workload.

Implications of the SPARROW Theorem
There isn't any storage system that can escape from the clutches of the SPARROW Theorem. A single system architecture CANNOT reduce all three SA, RA and WA. I am talking about the architecture and not of implementations.

A practical Database implementations would always try to reduce its current SA, WA and RA by optimizing its code and algorithms. But once all it's code and algorithms are optimized, then it won't be able to improve all three dimensions at the same time. Its performance will be confined by the walls outlined by the SPARROW Theorem.

Given the above fact, it would be great to have most database system be configurable so that an administrator can tune each of these three dimensions based on the workload and the hardware that it is running on. This will result in a highly flexible database system architecture that can sustain a myriad of workloads.

RocksDB

Wednesday, October 16, 2013

The SPARROW Theorem for performance of storage systems