Skunk Database System

By Gianni Tedesco. Last updated: 2018-04-09

skunk is a database system intended for workloads where query speed is paramount and data is written to infrequently in an append-only fashion. The system is currently experimental, and not available, but here are some ruminations on the design considerations and some of the salient technical details.

Table format (rec)

There are many options for how to store records. If records have primary keys then, often, records are embedded in to a b-tree. Other options are to use a paged-format and then allocate space heap-style from there. Fancy items such a schema version, or a null bitmap might also be included.

For our considerations, we don't care too much about primary-keyed tables, or can implement them with a different format later. Also we don't support nulled values or updates at all.

For this reason we chose to round up record size to a power of two. And then use a power-of-two of records inside each buffer, aiming for a page size of 1MiB. The trailing page is allowed to be shorter. This allows us to locate all records with shifts and masks and means we write to the storage device in 1MiB chunks which should be good for both spinning rust and SSD alike.

Not much to say about this, it's the simplest imaginable approach. As we look to support star-schemas and other primary-keyed tables later we will look in to a different table format for that.

Originally we did support updates and deletes. Deletes were based on bitmaps which were placed at the trailer of each page. But doing this caused locating a record to include an idiv (integer division) instruction that couldn't be optimised in to a multiply at compile-time. This led to a 10x slowdown in locating records. In retrospect a better approach would have been a seperate roaring bitmap - which allows random access - that would cover the entire table. Anyway, for my intended use-case, deletes are not important and updates would be a staggering rarity.

Insert Performance

A real-world dataset is used. It's 1 months of netflow records from a busy network.

Import 151,137,635 64-byte records (from binary) in to a table. Final table size is 9226 MiB. Note that space-overhead here is basically zero, this is partly luck that our record size is a power-of-two already. Overhead could be close to double in the worst-case.

At the same time as importing we Index src/dst ip's in to a single index, and protocol-id in to another index. This results in 609 MiB (656,047 keys) and 77 MiB (39 keys) indices respectively. Note that the biggest index is only a 6.5% overhead which is much better than with EWAH compressed bitmaps which almost double the size. A postings-list or b-tree would be more inefficient still.

Overall, the process takes 28.7s, or around 5.27M rows per second insert rate with indexing. Without indexing, we can do this in 7.31s, which is about 20M rows per second. There's still some room for optimisations here but this is a nice start. Indexing performance seems pretty bad, but it makes sense because we're not using write-optimised data structures since we care more about query performance. Still, it might be a nice thing to add in future as a trade-off that the user can make - much faster inserts in return for slightly slower queries.

Index Format (br64)

A big part of our intended use-case is using indices to speed up searches. Eventually there will be specialised indices for various data types such as text or multi-dimensional data. But for now we focus on integers of up to 64 bits in width.

Since our indices are constructed in a single-pass during table construction we have no reason to avoid b-trees whose worst-case performance is only encountered when performing random inserts. If a b-tree is "bulk-loaded" then we can obtain optimal packing and layout.

We chose a fanout of 512. This works out to a page size of 4KiB which should be good for mmapped access on both SSD and spinning rust - a random read anywhere within the memory mapped index file should bring in exactly one node. Keys and values are both 64bit integers. We lay out nodes in breadth-first-search (BFS) order.

For internal pages, the values are page-numbers pointing to a child page and keys are splitting keys. The unused splitting key at the start of the page is used to record the number of values in that page.

For leaf nodes, the values are implementation-defined. The only implementation right now is to point to a roaring-bitmap of records in the table which match the associated key. Since there is no space in the pages to record node-type or number of nodes, we store this information in the file-header. Specifically, the number of keys, and the page number of the first leaf node is stored. From this we can determine if a page is a leaf node by it's page number (pgno ≥ first leaf) and how many keys are stored in the possibly not-full final leaf (nr_keys modulo fanout). Note that these header values also permit very efficient index-scans, since our leaves are packed contiguously and can be located instantly. A useful optimisation for queries ordered by the search key.

Bitmaps are stored in something similar to roaring-bitmap style. This seems to provide much better compression ratios than run-length encoded (RLE) bitmaps while maintaining all the good properties (set operations can be performed directly on compressed representation). But also allows for fast random access - unlike RLE.

These bitmaps work by dividing up the value-space and chosing the smallest representation for each part of that space, based on its density. First, the space of record-numbers is divided up in to containers which cover 2¹⁶ contiguous records, starting from the zeroth. For each container, we maintain a count of the number of items which are present. For sparsely populated containers we use a sorted array of 16-bit unsigned integers, for semi-dense containers we use a bitmap, and for dense containers we use a sorted array of 16-bit integers mentioning all unset items (ie. an inverted-list representation). The boundaries between sparse, semi-dense and dense are chosen at the crossover points where each representation becomes the most space-efficient.

Unpacking and intersecting of the bitmaps can be performed pretty efficiently with just a little bit of clever bit twiddling in C. Further, several of the important cases, such as unpacking the bitmaps or intersecting bitmaps, can be optimized with AVX-512. List containers aren't as easily optimized but the silver lining there is that the list container is only used in very sparse (or very dense) regions which are, by definition, the easiest cases.

For 8 and 16 bit keys we can dispense with the b-tree and use a direct table-lookup, avoiding the need to binary search. For larger composite keys we will probably use a multi-level version of these schemes where each component of a composite key is indexed in a separate b-tree. This should assist with left-anchored queries and index-scans.

Query Planner

Right now the query planner is unremarkable and uses heuristics to pick indices to use and a join order that looks useful for hash joins. In future we will be implementing a cost-based query planner.

The write-once query-many architecture makes materialisation cheap and efficient for us, so we may experiment with tweaking the query planner to make more extensive use of materialisation, which is usually avoided in more conventional RDBMS's. One anticpated problem with this is that without limiting query plan searches only to left-deep trees then we increase our search space which will increase the cost of planning. But for prepared-statements or stored-procedures where the plan is generated only once, it might pay off, especially if we introduce a parametrised planner.

Another benefit of this architecture is that we can produce catalogs with accurate selectivity information which will be useful when picking indices and when planning equi-joins.

Query Evaluator

Currently, the query evaluator follows the volcano model. It's a tree of nodes which implement a generator function, iterating over the output of a previous node. In reality the tree is more of a linked list since the planner is using left-deep trees.

Evaluation uses the "pull" method. Iteration over the results is driven from the outside-in. That is to say that the front-end interface contains, rewind, get-next-row and is_eof? methods. This avoids the need for consumers to provide a callback function to receive the results of the queries.

However, this is a pretty inefficient design since everything works one row at a time. This means we need to spill our most performance criticial loop state in to memory rather than giving it a chance to stay in registers. Consequently, the overhead of the generators/iterators model starts to dominate the cost of query evaluation.

The inherent slowness of the volcano model has been worked around so far by making use of batching, especially in tablescan, and indexseek node types. More importantly the system is being moved towards more of a compiler-oriented design where multiple stages may be inlined together in to a well optimized piece of machine-code.

Bytecode Virtual Machine

Where possible, selection expressions are compiled in to a bytecode for efficient evaluation. Right now, only basic expressions are supported. That means column-to-column comparisons and column-to-literal comparisons joined by conjunctions (AND) or disjunctions (OR). For more complex queries involving arithmetic or variable assignments we fall back to other, more functional, methods.

The bytecode is designed to reduce the number of instructions as much as posssible. In my previous experiences with bytecode vm's, RISC-like approaches based on separate loading, storing and control-flow instructions increase overhead by making it necessary to perform more instruction dispatches. They also require more state, such as registers and condition flags.

In our system, the machine state is simply the instruction pointer, a halted flag, and a matched flag. The program is a linear array of instructions, contiguously following the machine state to reduce pointer chasing and boil the main-loop down to a bunch of fast lea and call instructions.

There are four instruction types:

Halt without match
Halt with match
Compare a column to a column
Compare a column to a literal

Each instruction contains a pointer to a function which implements the instruction. A different function implements each of the instruction types. The operand space in the instruction contains the information necessary to retrieve the field (an index to find the relation, and an offset and length to locate the column within the relation). And that is either doubled up for field-to-field compares, or a literal is added in place of the second field's information. Finally two addressess are included in each instruction, the first is followed if the comparison evaluates false, and the second is followed if the comparison evaluates true.

The instruction at address zero halts without a match. The instruction at address one halts with a match. The eliminates unnecessary condition checking and branching in the inner loop. The entry point is always at address two.

The compiler, itself is only 317 lines of code, half of which is checking an abstract-syntax tree representing the selection in order to determine if it representable in the bytecode. The main dispatch loop of the interpreter is 7 lines of code.

On the dataset mentioned below, a series of un-indexed equality conditions over 4 columns using full table scan over ~151M records was reduced from 9.7s to 2.9s of run-time. That translates to an improvement in evaluation rate from 15M rows per second to 50M rows per second.

Contrary to popular belief, the cause of the improved performance is not reduced branch-misses (which actually increased ten-fold) or cache misses (which stayed roughly equal). But simply a drastic reduction in instruction path-length (~99Bn down to ~27Bn), with a slight reduction in instructions per-clock. About 20% of the removed instructions were branches, however.

The main benefit of the machine is that it's pretty efficient - close to the performance of native code in fact. But the downside is that only simple operations are supported. Support for a much wider variety of selection expression could probably be achieved by adding new instructions and an accumulator - maybe even a stack. But we would want to be careful not to slow down the usual, simple, kinds of selections - the common case we've optimized for - by adding too much complexity.

JIT Compliation of Query Evaluators

In order to support more complex queries than the bytecode interpreter, libgccjit is used to compile selection expressions in to native machine code. This approach produces faster code at the expense of taking a bit more time and resources during the preparation of the query. Compiling and optimizing is hard work, after all. Certainly moreso than the simple methods used for the bytecode interpreter.

A well known limitation of JIT compilation in query evaluation is that most of the benefits to be had come from combining and inlining multiple stages of the query, which can often be difficult.

The skunk architecture allows us to combine full table scans and selections in to a single JITed program. This is because it's easy to iterate over all records with a simple for-loop without having to bring huge amounts of code to walk through a complicated b-tree or record struture.

Using this technique, the query used to test the bytecode evaluator above can be executed in around 1.5s, which gives us an evaluation rate of nearly 105M rows per second.

One shortcoming right now is that no special-case code is generated to catch exception conditions like undefined shifts or divides by zero. Typically the most difficult part about JITing evaluators is that as more internal functions are inlined, more and more errors and exception cases need to be handled within the JITd code which can greatly increase the complexity of this solution. After all, if writing some code itself is difficult, then writing code which generates an AST equivalent to that code is always going to be even more difficult.

Atomicity and Durability

Our append-only design makes atomicity and durability relatively easy to achieve. There is code for WAL operations that we might resurrect in future. Given our use cases, it's more likely that we will have a data-injestion API that provides special control over checkpointing and recovery.

Final Thoughts and Future Directions

Overall, for a few weeks of part-time work, I'm quite happy with the results. Certainly there are some interesting future directions still, especially in query planning and evaluation.

The most immediate task is rounding out the query language and implementing joins, aggregates, and other expected features of an RDBMS. After that the system will need a real cost-based query planner to make the best use of the query evaluator.

There are plans for a few exotic features to specifically optimize time-series tables. Those are tables where timestamped rows are appended, either periodically, or from a real-time event-stream. Also I'd be quite interested in experimenting with column-store record organisation if I can ever find the free time.