==============================================================================
SKUNK - A fast static hash database library
Copyright (c) 2001-2002 Gianni Tedesco <gianni@ecsc.co.uk>
This software is released under the terms of the GNU GPL version 2 or later
==============================================================================

Skunk creates static databases (ie: not designed to be modified after
creation). It supports databases up to 4GB in size. It is extremely fast.
The main aims are blistering high speed and very small footprint so you
can statically link it if you want.

An LGPL version of the skunk library will be released when the format has
stabilised. The usage of skunk is more as a library inside other applications,
the tools are just demos really...

Maybe one day I will extend it in to something that you can insert into and
delete from. The ability to run purely 'in memory' would be nice too. Probably
berkeley db is best for that kind of thing though.


Making a database
=================
 sdb_make takes a plain text file as input. The text file is formatted with ':'
 as a field delimiter and LF ('\n') as record delimiter. There can be only two
 fields per record, field 1 is the key and field 2 is the value.

 There are some tools for building database source files in the scripts
 directory. sv.py creates database source files from /etc/protocols or
 /etc/services. mkmillion creates a million record database.


Querying a database
===================
 sdb_query queries for a key in a DB file and returns the result. sdb_dump
 dumps ALL the data in the DB in a reusable format.


Why is skunk DB so fast?
========================

 sdb_make:
  No data is ever copied, the input file is mmapped
  The process does very little dynamic allocation
  Output is buffered in 8KB blocks
  Internal lists have the last element cached for O(1) appends
  For reliability we fsync() the database, this slows us right down
    but there is no way to avoid this. Any DB which doesn't fsync the
    file cannot guarantee your data is written to disk!

 sdb_query:
  Database is mmapped
  Lookups use no system calls
  Fowler/Noll/Vo hash, very fast
  Database is structured to take advantage of readahead
  All arithmetic is 32bit integer math
  No dynamic allocation is needed
  Successful lookups require an 8 byte and a 16 byte read on to
    the file (usually on two sequential blocks/pages)
  All metadata is stored in the hash table (a suprising win)
  Hash tables are given breathing space to prevent chaining

 sdb_dump:
  It isn't amazingly fast at all, its only there to help me
  debug databases... That said sdb_dump > /dev/null on a million
  record database takes only 1 second.

Speed Examples
==============

Preliminary benchmarks on my laptop.
 CPU: Coppermine 700MHz (256k cache)
 RAM: 256Mb RAM
 HD: IBM-DJSA-210 ATA33 laptop harddisk
 OS: Linux 2.4.9.
 FS: reiserfs 3.5.x (>90% full)

 * On a million record database, all records can be successfully queried in
   around 2 seconds. Profiling shows that around 37% of the time is spent
   creating the key and hashing it. Benchmarking cdb on the same data took
   3 seconds.

 * around 3.7 seconds to create a million record DB, reading data from the
   disk at the same time as writing it. About 25% of the time is spent parsing
   the input. Collision resolution absolutely murders performance here.

 * Dumping the database to /dev/null takes 1 second.
