skunk is a database system intended for workloads where query speed is paramount and data is written to infrequently in an append-only fashion. The system is currently experimental, and not available, but here are some of the salient technical details.
There are many options for how to store records. If records have primary keys then, often, records are embedded in to a b-tree. Other options are to use a paged-format and then allocate space heap-style from there. Fancy items such a schema version, or a null bitmap might also be included.
For our considerations, we don't care too mich about primary-keyed tables, or can implement them with a different format later. Also we don't support nulled values or updates at all.
For this reason we chose to round up record size to a power of two. And then use a power-of-two of records inside each buffer. Aiming for a page size of 1MiB. The trailing page is allowed to be shorter. This allows us to locate all records with shifts and masks and means we write to the storage device in 1MiB chunks which should be good for both spinning rust and SSD alike.
Not much to say about this, it's the simplest imaginable approach. As we look to support star-schemas and other primary-keyed tables later we will look in to a different table format for that.
Originally we did support updates and deletes. Deletes were based on bitmaps which were placed at the trailer of each page. But doing this caused locating a record to include an idiv (integer division) instruction that couldn't be optimised in to a multiply at compile-time. This led to a 10x slowdown in locating records. In retrospect a better approach would have been a seperate roaring bitmap - which allows random access - that would cover the entire table. Anyway, for my intended use-case, deletes are not important and updates would be a staggering rarity.
A big part of our intended use-case is using indices to speed up searches. Eventually there will be specialised indices for various data types such as text or multi-dimensional data. But for now we focus on integers of up to 64 bits in width.
Since our indices are constructed in a single-pass during table construction we have no reason to avoid b-trees whose worst-case performance is only encountered when performing random inserts. If a b-tree is "bulk-loaded" then we can obtain optimal packing and layout.
We chose a fanout of 512. This works out to a page size of 4KiB which should be good for mmapped access on both SSD and spinning rust - a random read anywhere within the memory mapped index file should bring in exactly one node. Keys and values are both 64bit integers. We lay out nodes in breadth-first-search (BFS) order.
For internal pages, the values are page-numbers pointing to a child page and keys are splitting keys. The unused splitting key at the start of the page is used to record the number of values in that page.
For leaf nodes, the values are implementation-defined. The only implementation right now is to point to a roaring-bitmap of records in the table which match the associated key. Since there is no space in the pages to record node-type or number of nodes, we store this information in the file-header. Specifically, the number of keys, and the page number of the first leaf node is stored. From this we can determine if a page is a leaf node by it's page number (pgno ≥ first leaf) and how many keys are stored in the possibly not-full final leaf (nr_keys modulo fanout). Note that these header values also permit very efficient index-scans, since our leaves are packed contiguously and can be located instantly. A useful optimisation for queries ordered by the search key.
Bitmaps are stored in something similar to roaring-bitmap style. This seems to provide much better compression ratios than run-length encoded (RLE) bitmaps while maintaining all the good properties (set operations can be performed directly on compressed representation). But also allows for fast random access - unlike RLE.
These bitmaps work by dividing up the value-space and chosing the smallest representation for each part of that space, based on its density. First, the space of record-numbers is divided up in to containers which cover 216 contiguous records, starting from the zeroth. For each container, we maintain a count of the number of items which are present. For sparsely populated containers we use a sorted array of 16-bit unsigned integers, for semi-dense containers we use a bitmap, and for dense containers we use a sorted array of 16-bit integers mentioning all unset items (ie. an inverted-list representation). The boundaries between sparse, semi-dense and dense are chosen at the crossover points where each representation becomes the most space-efficient.
For 8 and 16 bit keys we can dispense with the b-tree and use a direct table-lookup, avoiding the need to binary search. For larger composite keys we will probably use a multi-level version of these schemes where each component of a composite key is indexed in a separate b-tree. This should assist with left-anchored queries and index-scans.
Right now the query planner is unremarkable and uses heuristics to pick indices to use and a join order that looks useful for hash joins. In future we will be implementing a cost-based query planner.
Another optimisation will be to JIT compile selection bytecodes in to x86-64, at least for stored procedures.
The write-once query-many architecture makes materialisation cheap and efficient for us, so we may experiment with tweaking the query planner to make more extensive use of materialisation, which is usually avoided in more conventional RDBMS's. One anticpated problem with this is that without limiting query plan searches only to left-deep trees then we increase our search space which will increase the cost of planning. But for stored-procedures where the plan is generated only once, it might pay off, especially if we introduce a parametrised planner.
Another benefit of this architecture is that we can produce catalogs with accurate selectivity information which will be useful when picking indices and when planning equi-joins.
Currently, the query evaluator is a tree of nodes which perform each task. Joins are always left-deep so we don't do any materialisation.
Iteration over the results is driven from the outside-in. That is to say that the front-end interface contains, rewind, get-next-row and is_eof? methods. This avoids the need for callback functions in the query interface.
However, this is a pretty inefficient design since everything works one row at a time. This means we need to spill our most performance criticial inner-loop state in to memory rather than giving it a chance to stay in registers. Most painfully it's slow at doing index bitmap unions/intersects which benefit greatly from batching. It also doesnt allow us to easily find resultset sizes. Overall I think this design, while simple, and nice to work with from the front-end in some respects, is a failure.
We will develop a new system which makes more use of batching, and which, at least, materialises the final resultset, paginated if necessary. This should also be a more natural design when adding support for aggregation queries.
Where possible, selection expressions are compiled in to a bytecode for efficient evaluation. Right now, only basic expressions are supported. That means column-to-column comparisons and column-to-literal comparisons joined by conjunctions (AND) or disjunctions (OR). For more complex queries involving arithmetic or variable assignments we fall back to the, much slower, parse-tree based evaluator.
The bytecode is designed to reduce the number of instructions as much as posssible. In my previous experiences with bytecode vm's, RISC-like approaches based on separate loading, storing and control-flow instructions increase overhead by making it necessary to perform more instruction dispatches. They also require more state, such as registers and condition flags.
In our system, the machine state is simply the instruction pointer, a halted flag, and a matched flag. The program is a linear array of instructions, contiguously following the machine state to reduce pointer chasing and boil the main-loop down to a bunch of fast lea and call instructions.
There are four instruction types:
Each instruction contains a pointer to a function which implements the instruction. A different function implements each of the instruction types. The operand space in the instruction contains the information necessary to retrieve the field (an index to find the relation, and an offset and length to locate the column within the relation). And that is either doubled up for field-to-field compares, or a literal is added in place of the second field's information. Finally two addressess are included in each instruction, the first is followed if the comparison evaluates false, and the second is followed if the expression evaluates true.
The instruction at address zero halts without a match. The instruction at address one halts with a match. The eliminates unnecessary condition checking and branching in the inner loop. The entry point is always address 2.
The compiler, itself is only 317 lines of code, half of which is checking an abstract-syntax tree representing the selection in order to determine if it representable in the bytecode. The main dispatch loop of the interpreter is 7 lines of code.
On the dataset mentioned below, a series of un-indexed equality conditions over 4 columns using full table scan over ~151M records was reduced from 9.7s to 2.9s of run-time. That translates to an improvement in evaluation rate from 15M rows per second to 50M rows per second.
Contrary to popular belief, the cause of the improved performance is not reduced branch-misses (which actually increased ten-fold) or cache misses (which stayed roughly equal). But simply a drastic reduction in instruction path-length (~99Bn down to ~27Bn), with a slight reduction in instructions per-clock. About 20% of the removed instructions were branches, however.
We may experiment with a threaded main-loop in future, but it will probably be more useful to resurrect the x86-64 JIT compiler for this. Also it would be nice to have the code work in the case of more complex expressions. This could probably be achieved by adding new instructions and an accumulator. But we would want to be careful not to slow down the usual, simple, kinds of selections. Two separate machines might be required to provide good performance in all cases.
A real-world dataset is used. It's 1 months of netflow records from a busy network.
Import 151,137,635 64-byte records (from binary) in to a table. Final table size is 9226 MiB. Note that space-overhead here is basically zero, this is partly luck that our record size is a power-of-two already. Overhead could be close to double in the worst-case.
At the same time as importing we Index src/dst ip's in to a single index, and protocol-id in to another index. This results in 609 MiB (656,047 keys) and 77 MiB (39 keys) indices respectively. Note that the biggest index is only a 6.5% overhead which is much better than with EWAH compressed bitmaps which almost double the size. A block-list or b-tree would be more inefficient still.
Overall, the process takes 28.7s, or around 5.27M rows per second insert rate with indexing. Without indexing, we can do this in 7.31s, which is about 20M rows per second. There's still some room for optimisations here but this is a nice start. Indexing performance seems pretty bad, but it makes sense because we're not using write-optimised data structures since we care more about query performance. Still, it might be a nice thing to add in future as a trade-off that the user can make - much faster inserts in return for slightly slower queries.
Querying out 94 records using the protocol-id index and format them in ascii takes 0.0019. Or 1.9 miliseconds. The new query evaluator might improve on this but probably most of the overhead is in printf'ing the results and init/cleanup. This takes about 3.8s without the index.
Of course, this number is not hugely interesting right now. But until support for more complex queries is added, then that's all we can really look at.
Overall, for a few weeks of part-time work, I'm quite happy with the results. Even if they are not mind-blowing. Certainly there are some interesting future directions still, especially in query planning and evaluation.
Our design makes durability quite easy to achieve. There is code for WAL operations that we might resurrect in future. Given our use cases, it's more likely that we will have a data-injestion API that provides special control over checkpointing and recovery.
Copyright (c) Spanish Inquisition 1478-1834. All rights reversed.