==============================================================================
SKUNK - A fast static hash database library
Copyright (c) 2001-2002 Gianni Tedesco <gianni@ecsc.co.uk>
This software is released under the terms of the GNU GPL version 2 or later
==============================================================================

Skunk creates static databases (ie: not designed to be modified after
creation). It supports databases up to 4GB in size. It is extremely fast.
The main aims are blistering high speed and very small footprint (so you
can just link it as a .o file if needs be).

An LGPL version of the skunk library will be released when the format has
stabilised. The usage of skunk is more as a library inside other applications,
the tools are just demos really...

Maybe one day I will extend it in to something that you can insert into and
use purely 'in memory'. Probably berkeley db is best for that kind of thing
though.

Making a database
=================
 sdb_make takes a plain text file as input. The text file is formatted with ':'
 as a field delimiter and LF ('\n') as record delimiter. There can be only two
 fields per record, field 1 is the key and field 2 is the value.

 There are some tools for building database source files in the scripts
 directory. sv.py creates database source files from /etc/protocols or
 /etc/services. mkmillion creates a million record database.


Querying a database
===================
 sdb_query queries for a key in a DB file and returns the result. sdb_dump
 dumps ALL the data in the DB (in a non-reusable format currently).


Why is skunk DB so fast?
========================

 sdb_make:
  No data is ever copied, the input file is mmapped
  The process does very little dynamic allocation
  Output is buffered in 4KB blocks
  Internal lists are lists of arrays - cuts down on dynamic allocation
  Internal lists have the last element cached for fast appends

 sdb_query:
  Database is mmapped
  Lookups use no system calls
  Fowler/Noll/Vo hash, very fast
  Database is structured to take advantage of readahead
  All arithmetic is 32bit integer math
  No dynamic allocation is needed
  Successful lookups require an 8 byte and a 16 byte read on to
    the file (usually on two sequential blocks/pages)
  All metadata is stored in the hash table (a suprising win)
  Hash tables are given breathing space to prevent chaining

 sdb_dump:
  It isn't amazingly fast at all, its only there to help me
  debug databases... That said sdb_dump > /dev/null on a million
  record database takes only 5 seconds.


Speed Examples
==============

Preliminary benchmarks on my laptop.
Coppermine 700MHz / 256Mb RAM, single ATA33 harddisk.
Software is Linux 2.4.9 using reiserfs 3.5.x filesystems.

 * On a million record database, all records can be successfully queried in
   around 2.5 seconds. Profiling shows that around 37% of the time is spent
   creating the key and hashing it.

 * arround 7 seconds to create a million record DB, reading data from the
   disk at the same time as writing it. 23% of the time is spent parsing
   input, and about 50% is due to the collision resolution calculations.
   Collision resolution absolutely murders performance here - it did take
   about 2.8 seconds before I added it.

 * Dumping the database to /dev/null takes 3.7 seconds.
