skunk is a database system intended for workloads where query speed is paramount and data is written to infrequently in an append-only fashion. The system is currently experimental, and not available, but here are some of the salient technical details.
There are many options for how to store records. If records have primary keys then, often, records are embedded in to a b-tree. Other options are to use a paged-format and then allocate space heap-style from there. Fancy items such a schema version, or a null bitmap might also be included.
For our considerations, we don't care too mich about primary-keyed tables, or can implement them with a different format later. Also we don't support nulled values or updates at all.
For this reason we chose to round up record size to a power of two. And then use a power-of-two of records inside each buffer. Aiming for a page size of 1MiB. The trailing page is allowed to be shorter. This allows us to locate all records with shifts and masks and means we write to the storage device in 1MiB chunks which should be good for both spinning rust and SSD alike.
Not much to say about this, it's the simplest imaginable approach. As we look to support star-schemas and other primary-keyed tables later we will look in to a different table format for that.
Originally we did support updates and deletes. Deletes were based on bitmaps which were placed at the trailer of each page. But doing this caused locating a record to include an idiv (integer division) instruction that couldn't be optimised in to a multiply at compile-time. This led to a 10x slowdown in locating records. In retrospect a better approach would have been a seperate roaring bitmap - which allows random access - that would cover the entire table. Anyway, for my intended use-case, deletes are not important and updates would be a staggering rarity.
A big part of our intended use-case is using indices to speed up searches. Eventually there will be specialised indices for various data types such as text or multi-dimensional data. But for now we focus on integers of up to 64 bits in width.
Since our indices are constructed in a single-pass during table construction we have no reason to avoid b-trees whose worst-case performance is only encountered when performing random inserts. If a b-tree is "bulk-loaded" then we can obtain optimal packing and layout.
We chose a fanout of 2^16 items. This works out to a page size of 1MiB. Keys and values are both 64bit integers. We lay out nodes in breadth-first-search (BFS) order.
For internal pages, the values are page-numbers pointing to a child page and keys are splitting keys. The unused splitting key at the start of the page is used to record the number of values in that page.
For leaf nodes, the values are implementation-defined. The only implementation right now is to point to a roaring-bitmap of records in the table which match the associated key. Since there is no space in the pages to record node-type or number of nodes, we store this information in the file-header. Specifically, the number of keys, and the page number of the first leaf node is stored. From this we can determine if a page is a leaf node by it's page number (pgno ≥ first leaf) and how many keys are stored in the possibly not-full final leaf (nr_keys modulo fanout). Note that these header values also permit very efficient index-scans, since our leaves are packed contiguously and can be located instantly. A useful optimisation for queries ordered by the search key.
Bitmaps are stored in something similar to roaring-bitmap style. This seems to provide much better compression ratios than run-length encoded (RLE) bitmaps while maintaining all the good properties (set operations can be performed directly on comrpessed representation). But also allows for fast random access - unlike RLE.
These bitmaps work by dividing up the value-space and chosing the smallest representation for each part of that space, based on its density. First, the space of record-numbers is divided up in to containers which cover 216 contiguous records, starting from the zeroth. For each container, we maintain a count of the number of items which are present. For sparsely populated containers we use a sorted array of 16-bit unsigned integers, for semi-dense containers we use a bitmap, and for dense containers we use a sorted array of 16-bit integers mentioning all unset items (ie. an inverted-list representation). The boundaries between sparse, semi-dense and dense are chosen at the crossover points where each representation becomes the most space-efficient.
For 8 and 16 bit keys we can dispense with the b-tree and use a direct table-lookup, avoiding the need to binary search. For larger composite keys we will probably use a multi-level version of these schemes where each component of a composite key is indexed in a separate b-tree. This should assist with left-anchored queries and index-scans.
Right now the query planner is unremarkable and uses heuristics to pick indices to use and a join order that looks useful for hash joins. In future we will be implementing a cost-based query planner.
Another optimisation will be to JIT compile selection bytecodes in to x86-64, at least for stored procedures.
The write-once query-many architecture makes materialisation cheap and efficient for us, so we may experiment with tweaking the query planner to make more extensive use of materialisation, which is usually avoided in more conventional RDBMS's. One anticpated problem with this is that without limiting query plan searches only to left-deep trees then we increase our search space which will increase the cost of planning. But for stored-procedures where the plan is generated only once, it might pay off, especially if we introduce a parametrised planner.
Another benefit of this architecture is that we can produce catalogs with accurate selectivity information which will be useful when picking indices and when planning equi-joins.
Currently, the query evaluator is a tree of nodes which perform each task. Joins are always left-deep so we don't do any materialisation.
This is a pretty inefficient design since everything works one row at a time. Most painfully it's slow at doing index bitmap unions/intersects which benefit greatly from batching. It also doesnt allow us to easily find resultset sizes. Overall I think this design, while simple, and nice to work with from the front-end in some respects, is a failure.
We will develop a new system which makes more use of batching, and which, at least, materialises the final resultset, paginated if necessary. This should also be a more natural design when adding support for aggregation queries.
Another problem is that selections are evaluated as expression-trees rather than as bytecodes, so this is also pretty cache-inefficient and precludes JITing.
A real-world dataset is used. It's 1 months of netflow records from a busy network.
Import 151,137,635 64-byte records (from binary) in to a table. Final table size is 9226 MiB. Note that space-overhead here is basically zero, this is partly luck that our record size is a power-of-two already. Overhead could be close to double in the worst-case.
At the same time as importing we Index src/dst ip's in to a single index, and protocol-id in to another index. This results in 609 MiB (656,047 keys) and 77 MiB (39 keys) indices respectively. Note that the biggest index is only a 6.5% overhead which is much better than with EWAH compressed bitmaps which almost double the size. A block-list or b-tree would be more inefficient still.
Overall, the process takes 42s, or around 3.56M rows per second insert rate with indexing. Without indexing, we can do this in 7.31s, which is about 20M rows per second. There's still some room for optimisations here but this is a nice start. Indexing performance seems pretty bad, but it makes sense because we're not using write-optimised data structures since we care more about query performance. Still, it might be a nice thing to add in future as a trade-off that the user can make - much faster inserts in return for slightly slower queries.
Querying out 94 records using the protocol-id index and format them in ascii takes 0.0025. Or 2.5 miliseconds. The new query evaluator might improve on this but probably most of the overhead is in printf'ing the results and init/cleanup. This takes about 3.8s without the index.
Of course, this number is not hugely interesting right now. But until support for more complex queries is added, then that's all we can really look at.
Overall, for a few weeks of part-time work, I'm quite happy with the results. Even if they are not mind-blowing. Certainly there are some interesting future directions still, especially in query planning and evaluation.
Our design makes durability quite easy to achieve. There is code for WAL operations that we might resurrect in future. Given our use cases, it's more likely that we will have a data-injestion API that provides special control over checkpointing and recovery.
Copyright (c) Spanish Inquisition 1478-1834. All rights reversed.