Storage Engines for Databases
(15 minutes read)
We will talk about LSM Trees (instead of conventional B+ Trees based databases) which is used in databases such as Cassandra, Google BigTable etc. For simplicity, we are designing a basic Key-Value Pair DB Storage, without ACID/BASIC complications.
DB containers should internally be backed by two storage engines—a transactional storage engine and a new analytical storage engine. Both the storage engines are log-structured and write-optimized for blazing fast updates. The transactional storage engine is encoded in row-oriented format for blazing fast transactional reads and queries. The analytical storage engine is encoded in columnar format for fast analytical queries and scans and is fully updatable. The transactional storage engine is backed by local SSDs, and the analytical storage is stored on a cost-efficient, off-cluster SSD storage.
Wikipedia Definition of a Storage engine: A database engine (or storage engine) is the underlying software component that a database management system (DBMS) uses to create, read, update and delete (CRUD) data from a database. Most database management systems include their own application programming interface (API) that allows the user to interact with their underlying engine without going through the user interface of the DBMS.
LSM trees are persisted to disk using a Sorted Strings Table (SSTable) format. As indicated by the name, SSTables are a format for storing key-value pairs in which the keys are in sorted order. An SSTable will consist of multiple sorted files called segments (Elasticsearch fans might relate 😊 ). These segments are immutable once they are written to disk.
One problem for you guys is how do we merge these segments efficiently? (Real Chocolate money via UPI financed for the guy solving this first).
Flow for writing data
Since writes will not come in sorted key order, an in-memory Red-Black Tree (or simply just a balanced BST) is used. As writes come in, the data is written to this in-memory store. Once the red-black tree has enough entries, it is flushed to disk as a segment on disk in sorted order. This allows us to write the segment file as a single sequential write even though the inserts may occur in any order.
Flow for reading data
If we are reading from the disk, each segment in reverse timestamp order (assuming each key value holds one line) could be binary searched but we can achieve better performance since all these operations are on disk right now. Time Complexity is O(Number of Segments * log |Keys|)
(Extra Information – 1MB Sequential Read from SSD – 1ms vs 1MB Sequential Read from Memory – 0.25 ms. Memory is way cheaper from SSDs as well. Parallel Thread reads are way more efficient on memory).
We can create an in-memory sparse-index. We can use this index to quickly find the offsets for values that would come before and after the key we want. Now we only have to scan a small portion of each segment file based on those bounds.
Flow for reducing the number of segment files
Too many sorted arrays in our storage, it is a problem since we have to optimize reads as well and for reads, we have to search them.
We sort similar size arrays and organize them into increasing size capacities.
sortedarr1 of size3
sortedarray2 of size6
sortedarra3 of size12
The state will always look like that because we will only flush arrays of size3 from in-memory source to here and as soon as we get two arrays of size3, we merge them.
Read now looks like this: binary search level1 array, not found, search level 2, not found, level 3 and so on.
So, calls to sorted arrays will be more for each read.
We will make another layer in main memory(fence pointers) which will contain the min and max of each array. i.e
Now, first we will do binary search on this in-memory array, determine the level, and fetch the value from that level.
Also, bloom filters are also mounted corresponding to every sortedarr. They will do the same thing as fence pointers of not to issue commands to sortedarr.
which do not contain the value but with one caveat (true negative and rarely flase positives)
Merging frequency: more merging will lead to less number of sorted arrays in the system, less number of ranges to check at fence pointer level and eventually less calls to each sorted arr on storage side and better reads but costly writes.
Each level will be having some capacity of number of sorted arrays. As soon as num of sorted arrays reach this capacity, we will merge them all and put them to level below it according to the above diagram.
As soon as new sorted array is flushed from in-memory side to storage side into one level, we will take it and merge it to the bigger array in the level below it. So we will always have num sorted array at any level equal to 1. and this append takes place until a max size is not reached and then same process continues.
Flow for Deleting data
Deletes actually follow the exact same path as writing data. Whenever a delete request is received, a unique marker called a tombstone is written for that key. Similar is the flow for updates.
Handling Box Crashes
This works very well except for one problem: if the database crashes, the most recent writes (which are in the memory but not yet written out to disk) are lost. In order to avoid that problem, we can keep a separate log on disk to which every write is immediately appended, just like in the previous section. That log is not in sorted order, but that doesn’t matter, because its only purpose is to restore the memory database after a crash. Every time the memory database is written out to an SSTable, the corresponding log can be discarded.
Full Text Search DBs
Lucene, an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary. A full-text index is much more complex than a key-value index but is based on a similar idea: given a word in a search query, find all the documents (web pages, product descriptions, etc.) that mention the word. This is implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list). In Lucene, this mapping from term to postings list is kept in SSTable-like sorted files, which are merged in the background as needed. You can also guess that Lucene based Storage is costlier to scale than the counter parts from this.