ezcdb::cdb.5

constant database multitool
cdb(5)                               ezcdb                              cdb(5)



NAME
       cdb - constant database file format

DESCRIPTION
       A  cdb  file is implemented as a partitioned hash table.  It provides a
       set of mappings for record ``keys'' to record  ``values'',  where  keys
       and records are any arbitrary strings of bytes.

       A cdb file is comprised of three sections:

           P R H

       Where:

       P[0..255]
              An  array  of  256  pointer elements providing a ``table of con-
              tents'' for the 256 hash subtables in section H.   Each  element
              P[ntab] is eight bytes: 4 bytes offset to start of hash subtable
              ntab; 4 bytes number of slots in hash subtable ntab.  A subtable
              may have zero slots.

       R[0..(nrecs - 1)]
              The  sequential  collection  of  nrecs  records in the database.
              Storage for each record R[nrec] is a concatenation of:  4  bytes
              key  length; 4 bytes value length; record key; and record value.

       H[0..255][0..(tslots - 1)]
              An array of 256 hash subtables.  Each subtable H[ntab]  is  com-
              prised  of  P[ntab].tslots  slots.   Each slot is eight bytes: 4
              bytes hash value; 4 bytes offset to associated record in section
              R.  An offset value of 0 indicates an empty slot, and no associ-
              ated record exists.

       Hash values are calculated for any given byte string key of length klen
       according to the following algorithm:

           uint32_t
           cdb_hash(unsigned char *key, uint32_t klen)
           {
             uint32_t  h = 5381;

             while(klen){
               h = h * 33;
               h = h ^ *key;
               ++key; --klen;
             }

             return h;
           }

       The subtable ntab for a given key is calculated from the relation:

           ntab = (hash % 256).

       The  subtable offset and number of subtable slots is read from P[ntab].
       The target slot for a given key in a subtable is computed from:

           target = ((hash >> 8) % tslots)

       where tslots is the number of  slots  in  the  subtable  as  read  from
       P[ntab].

       A record is located according to an open-addressing hash scheme as fol-
       lows.  First compute the hash value, subtable, and target slot for  the
       record  key.  Begin probing in subtable H[ntab], starting with the tar-
       get slot, then the next slot, and so forth, while checking at each step
       in section R for a matching key.  Continue until either the record with
       matching key is found, or until an empty slot is encountered.

       All offset, length, and hash values are stored as 32-bit unsigned inte-
       gers  packed  into  a  portable, little-endian form, 4 bytes per value.
       There is no particular constraint on sizes of keys or values, or number
       of  records  in  a  database,  except  for the range limits inherent in
       32-bit integers.  Consequently, a cdb file may not exceed 2^32-1  bytes
       (4 gigabytes).

NOTES
       The  cdb hash function is simple, fast, and readily factored into basic
       machine instructions.  When the function is considered on its own  mer-
       its,  outside  of the cdb context, the distribution results are usually
       neither spectacular nor tragic.  In particular --  and  common  to  all
       hash  functions of this type -- the result is strongly patterned by the
       last byte of the key.

       For example, consider the following keys and cdb hash  values  from  an
       airport codes database:

           ABJ  0x0b87b6ac
           ABK  0x0b87b6ad
           ABL  0x0b87b6aa
           ABM  0x0b87b6ab

       As can be seen here, the hash value for each of these keys di