==============================================================================
SKUNK - A fast static hash database library
Copyright (c) 2001-2002 Gianni Tedesco <gianni@ecsc.co.uk>
This software is released under the terms of the GNU GPL version 2 or later
==============================================================================

Skunk creates static databases (ie: not designed to be modified after
creation). It supports databases up to 4GB in size. It is extremely fast.
The main aims are blistering high speed and very small footprint so you
can statically link it if you want.

Maybe one day I will extend it in to something that you can insert into and
delete from. The ability to run purely 'in memory' would be nice too. Probably
berkeley db is best for that kind of thing though.


Making a database
=================
 sdb_make takes a plain text file as input. The text file is formatted with ':'
 as a field delimiter and LF ('\n') as record delimiter. There can be only two
 fields per record, field 1 is the key and field 2 is the value.

 There are some tools for building database source files in the scripts
 directory. sv.py creates database source files from /etc/protocols or
 /etc/services. mkmillion creates a million record database.


Querying a database
===================
 sdb_query queries for a key in a DB file and returns the result. sdb_dump
 dumps ALL the data in the DB in a reusable format.


Why is skunk DB so fast?
========================

 sdb_make:
  No data is ever copied, the input file is mmapped
  The process does very little dynamic allocation
  Output is buffered in 8KB blocks
  Internal lists have the last element cached for O(1) appends
  For reliability we fsync() the database, this slows us right down
    but there is no way to avoid this. Any DB which doesn't fsync the
    file cannot guarantee your data is written to disk!

 sdb_query:
  Database is mmapped
  Lookups use no system calls
  Fowler/Noll/Vo hash, very fast
  Database structure has good locality of reference
  All arithmetic is 32bit integer math
  No dynamic allocation is needed
  Successful lookups require an 8 byte and a 16 byte read on to
    the file (usually on two sequential blocks/pages)
  All metadata is stored in the hash table (a suprising win)
  Hash tables are never created with a load factor greater than 66%
    to prevent chaining
  Hash tables are never powers of two in size

 sdb_dump:
  It isn't amazingly fast at all, its only there to help me
  debug databases... That said sdb_dump > /dev/null on a million
  record database takes only 1 second.

Speed Examples
==============

Preliminary benchmarks on my laptop.
 CPU: powerpc 7455 667MHz
 RAM: 512Mb RAM
 HD: IBM-IC25N040ATCS04-0 ATA33 laptop harddisk
 OS: Linux 2.4.19.
 FS: reiserfs 3.6.25

 * On a million record database, all records can be successfully queried in
   around 0.425 seconds. Thats around 2.35 million queries a second. Thats
   half a microsecond per query.

 * around 3.7 seconds to create a million record DB, reading data from the
   disk at the same time as writing it. About 25% of the time is spent parsing
   the input. Collision resolution absolutely murders performance here.

 * Dumping the database to /dev/null takes 1 second.
