constant database multitool
cdb(5)                               ezcdb                              cdb(5)

       cdb - constant database file format

       A  cdb  file is implemented as a partitioned hash table.  It provides a
       set of mappings for record ``keys'' to record  ``values'',  where  keys
       and records are any arbitrary strings of bytes.

       A cdb file is comprised of three sections:

           P R H


              An  array  of  256  pointer elements providing a ``table of con-
              tents'' for the 256 hash subtables in section H.   Each  element
              P[ntab] is eight bytes: 4 bytes offset to start of hash subtable
              ntab; 4 bytes number of slots in hash subtable ntab.  A subtable
              may have zero slots.

       R[0..(nrecs - 1)]
              The  sequential  collection  of  nrecs  records in the database.
              Storage for each record R[nrec] is a concatenation of:  4  bytes
              key  length; 4 bytes value length; record key; and record value.

       H[0..255][0..(tslots - 1)]
              An array of 256 hash subtables.  Each subtable H[ntab]  is  com-
              prised  of  P[ntab].tslots  slots.   Each slot is eight bytes: 4
              bytes hash value; 4 bytes offset to associated record in section
              R.  An offset value of 0 indicates an empty slot, and no associ-
              ated record exists.

       Hash values are calculated for any given byte string key of length klen
       according to the following algorithm:

           cdb_hash(unsigned char *key, uint32_t klen)
             uint32_t  h = 5381;

               h = h * 33;
               h = h ^ *key;
               ++key; --klen;

             return h;

       The subtable ntab for a given key is calculated from the relation:

           ntab = (hash % 256).

       The  subtable offset and number of subtable slots is read from P[ntab].
       The target slot for a given key in a subtable is computed from:

           target = ((hash >> 8) % tslots)

       where tslots is the number of  slots  in  the  subtable  as  read  from

       A record is located according to an open-addressing hash scheme as fol-
       lows.  First compute the hash value, subtable, and target slot for  the
       record  key.  Begin probing in subtable H[ntab], starting with the tar-
       get slot, then the next slot, and so forth, while checking at each step
       in section R for a matching key.  Continue until either the record with
       matching key is found, or until an empty slot is encountered.

       All offset, length, and hash values are stored as 32-bit unsigned inte-
       gers  packed  into  a  portable, little-endian form, 4 bytes per value.
       There is no particular constraint on sizes of keys or values, or number
       of  records  in  a  database,  except  for the range limits inherent in
       32-bit integers.  Consequently, a cdb file may not exceed 2^32-1  bytes
       (4 gigabytes).

       The  cdb hash function is simple, fast, and readily factored into basic
       machine instructions.  When the function is considered on its own  mer-
       its,  outside  of the cdb context, the distribution results are usually
       neither spectacular nor tragic.  In particular --  and  common  to  all
       hash  functions of this type -- the result is strongly patterned by the
       last byte of the key.

       For example, consider the following keys and cdb hash  values  from  an
       airport codes database:

           ABJ  0x0b87b6ac
           ABK  0x0b87b6ad
           ABL  0x0b87b6aa
           ABM  0x0b87b6ab

       As can be seen here, the hash value for each of these keys differs only
       in the final byte, and according to the last byte of the key.  Only the
       single  tiny  final bit of the 32-bit hash separates ABL from ABM, even
       though one is in Alaska, and the other is in Australia.

       Within the cdb context, this apparent hash deficiency is turned to some
       advantage.  The last byte of the cdb hash is used only for partitioning
       keys into 256 subtables.  These lower 8 bits are then discarded  before
       proceeding  to compute the target slot.  The result is that slot selec-
       tion will be using the higher bits of the hash value, which  have  been
       better  mixed, and will generally have more apparently random distribu-
       tion properties.  Also, keys sharing a common prefix (like  those  with
       ``AB'' above) will be segregated into separate subtables, thus reducing
       the collision probabilities among them.

       Nevertheless, it should not be supposed that a scheme of multiple  sub-
       tables generally and universally improves hash distribution/performance
       when compared to using a single table.  Even when using  the  cdb  hash
       function,  a  single  hash table will usually have equivalent or better
       statistical properties on the same dataset.  In fact, the main  benefit
       of  using  multiple  subtables  seems to reside primarily in the memory
       required for cdb file generation.  Here the scheme of  multible  subta-
       bles  permits  the use of a smaller memory allocation than would other-
       wise be required for a single large table.

       The load factor of each hash subtable is not defined by the  specifica-
       tion.   Implementations  commonly  use a load factor of 0.50.  That is,
       each subtable is set with twice as many slots as elements in  the  sub-
       table, and half the slots will be empty.

       By convention, cdb files are named with a ``.cdb'' extension.  There is
       otherwise no explicit header or ``magic'' within a cdb file by which it
       may be identified.

       The  cdb file format specification was originally designed by Daniel J.
       Bernstein and published in 1996.

       Wayne Marshall,


ezcdb-0.01                       December 2010                          cdb(5)