hash database (hdb32) multitool
hdb(5)                                hdb                               hdb(5)



NAME
       hdb - hash database (hdb32) file format

DESCRIPTION
       An  hdb file is implemented as a partitioned hash table.  It provides a
       set of mappings for record ``keys'' to record  ``values'',  where  keys
       and records are any arbitrary strings of bytes.

       An  hdb  file  is implemented as a ``write-once, read-many'' datastore.
       It is generally used by applications  requiring  fast  read  access  to
       ``constant''  data,  and  is modeled after the cdb(5) constant database
       specification.

       A hdb file is comprised of six sections:

           I M P C R H

       Where:

       I: identifier
              A file type identifier of 16 bytes.  The  identifier  for  files
              conforming to this specification is ``hdb32/1.0\0\0\0\0\0\0\0''.

       M: metadata
              File metadata of 8 bytes consisting  of:  4  bytes  nrecs  total
              records  in database; and 4 bytes offset to start of record sec-
              tion R.

       P: hash subtable pointers
              An array of 8 pointer elements providing a ``table of contents''
              for  the 8 hash subtables in section H.  Each element P[ntab] is
              eight bytes: 4 bytes number of slots in hash  subtable  ntab;  4
              bytes  offset  to  start  of hash subtable ntab.  A subtable may
              have zero slots.

       C: descriptive comment
              An optional comment of any length, normally used to describe the
              dataset.

       R: records
              The  sequential  collection  of  nrecs  records in the database.
              Storage for each record R[nrec] is a concatenation of:  3  bytes
              key  length; 3 bytes value length; record key; and record value.

       H: hash tables
              An array of 8 hash subtables.  Each  subtable  H[ntab]  is  com-
              prised  of  P[ntab].tslots  slots.   Each slot is eight bytes: 4
              bytes hash value; 4 bytes offset to associated record in section
              R.  An offset value of 0 indicates an empty slot, and no associ-
              ated record exists.

       Hash values are calculated for any given byte string key of length klen
       according to the following algorithm:

           uint32_t
           hdb32_hash(unsigned char *key, uint32_t klen)
           {
             uint32_t  h = 0;

             while(klen){
               h = h ^ *key;
               h = h * 37;
               ++key; --klen;
             }

             return h;
           }

       The subtable ntab for a given key is calculated from the relation:

           ntab = (hash % 8).

       The  subtable offset and number of subtable slots is read from P[ntab].
       The target slot for a given key in a subtable is computed from:

           target = ((HMIX(hash) >> 3) % tslots)

       where tslots is the number of  slots  in  the  subtable  as  read  from
       P[ntab], and HMIX(hash) is a bit operation defined as:

           (hash >> 13) ^ hash

       A record is located according to an open-addressing hash scheme as fol-
       lows.  First compute the hash value, subtable, and target slot for  the
       record  key.  Begin probing in subtable H[ntab], starting with the tar-
       get slot, then the next slot, and so forth, while checking at each step
       in section R for a matching key.  Continue until either the record with
       matching key is found, or until an empty slot is encountered.

       All offset and hash values  are  stored  as  32-bit  unsigned  integers
       packed  into a portable, little-endian form, 4 bytes per value.  Length
       values for keys and values are  stored  as  24-bit  unsigned  integers,
       packed into a portable, little-endian form, 3 bytes per value.  A hdb32
       file may not exceed (2^32 - 1) bytes (4 gigabytes).  Lengths  of  indi-
       vidual  keys and values may not exceed (2^24 - 1) bytes (16 megabytes).

NOTES
       This specification is qualified as the ``hdb32''  version  of  the  hdb
       file  format.   It defines the standard hdb specification for a maximum
       file size based on 32-bit unsigned integers.  As the general hdb format
       may be adjusted to suit other integer types in the future, those speci-
       fications should be qualified accordingly.

       The hdb32 hash function is simple, reasonably fast,  and  readily  fac-
       tored  into  basic  machine instructions.  Otherwise it is pathetically
       unremarkable, particularly given the man-months wasted on  testing  and
       devising numerous other alternatives and parameters.

       The  function itself is mostly just the standard textbook hash, select-
       ing its mix multiplier (37) from among the usual suspects (31, 33, 37).
       It  differs  only  slightly from the K&R form by inverting the order of
       mix and combine operations.  Even with this reordering, the hash result
       is still strongly patterned by the last byte of the key.

       For  example, consider the following keys and hdb32 hash values from an
       airport codes database:

           ABJ  0x0030fbad
           ABK  0x0030fb88
           ABL  0x0030fc8b
           ABM  0x0030fc66

       As can be seen here, the hash value for each of these keys differs gen-
       erally  in  only  the final byte, and according to the last byte of the
       key.  No  avalanche  is  at  play,  no  randomness  whatsoever  in  the
       sequenced progression of bit patterns.

       Yet  despite  a lack of aesthetic and theoretical appeal, this function
       does appear to provide adequate performance within the  hdb32  context.
       The  3 bits used for subtable partitioning may provide for some disper-
       sion of aggregate clusters, and the postmix applied before slot  selec-
       tion distributes some of the better-mixed bits in the higher bytes down
       into the lower bits.  Empirical observations  suggest  that  the  hdb32
       hash/subtable  strategy  provides  for  consistent statistical behavior
       over widely varying types of record input.

       It should not be supposed that the use of multiple subtables  generally
       and universally improves hash distribution/performance when compared to
       use of a single table.  (Nor has my own testing  revealed  some  magic,
       prime  number  of subtables that would consistently outperform either a
       single table or power-of-2 multiple.)  In fact,  given  the  same  hash
       function,  a single hash table will generally have equivalent or better
       statistical properties on the same dataset when  compared  to  multiple
       tables.

       The  primary  benefit  of using multiple subtables is actually found in
       the conservation of memory during hdb file generation.  By using 8 sub-
       tables, the hdb-make(1) operation uses a much smaller memory allocation
       than would otherwise be required for a single large table.

       The load factor of each hash subtable is not defined by the  specifica-
       tion.   The  reference implementation uses a load factor of 0.50.  That
       is, each subtable is set with twice as many slots as  elements  in  the
       subtable, and half the slots will be empty.

       By  convention,  hdb32 files are named with an ``.hdb'' extension.  (It
       is suggested that an ``.hdbx'' extension be used to differentiate files
       following some future hdb64 specification.)

       The  hdb  file format specification was first published with the hdb(1)
       reference implementation in December 2010.

AUTHOR
       Wayne Marshall, http://b0llix.net/hdb/

SEE ALSO
       hdb(1).



hdb-0.03                         January 2013                           hdb(5)