cdb(5) ezcdb cdb(5)
NAME
cdb - constant database file format
DESCRIPTION
A cdb file is implemented as a partitioned hash table. It provides a
set of mappings for record ``keys'' to record ``values'', where keys
and records are any arbitrary strings of bytes.
A cdb file is comprised of three sections:
P R H
Where:
P[0..255]
An array of 256 pointer elements providing a ``table of con-
tents'' for the 256 hash subtables in section H. Each element
P[ntab] is eight bytes: 4 bytes offset to start of hash subtable
ntab; 4 bytes number of slots in hash subtable ntab. A subtable
may have zero slots.
R[0..(nrecs - 1)]
The sequential collection of nrecs records in the database.
Storage for each record R[nrec] is a concatenation of: 4 bytes
key length; 4 bytes value length; record key; and record value.
H[0..255][0..(tslots - 1)]
An array of 256 hash subtables. Each subtable H[ntab] is com-
prised of P[ntab].tslots slots. Each slot is eight bytes: 4
bytes hash value; 4 bytes offset to associated record in section
R. An offset value of 0 indicates an empty slot, and no associ-
ated record exists.
Hash values are calculated for any given byte string key of length klen
according to the following algorithm:
uint32_t
cdb_hash(unsigned char *key, uint32_t klen)
{
uint32_t h = 5381;
while(klen){
h = h * 33;
h = h ^ *key;
++key; --klen;
}
return h;
}
The subtable ntab for a given key is calculated from the relation:
ntab = (hash % 256).
The subtable offset and number of subtable slots is read from P[ntab].
The target slot for a given key in a subtable is computed from:
target = ((hash >> 8) % tslots)
where tslots is the number of slots in the subtable as read from
P[ntab].
A record is located according to an open-addressing hash scheme as fol-
lows. First compute the hash value, subtable, and target slot for the
record key. Begin probing in subtable H[ntab], starting with the tar-
get slot, then the next slot, and so forth, while checking at each step
in section R for a matching key. Continue until either the record with
matching key is found, or until an empty slot is encountered.
All offset, length, and hash values are stored as 32-bit unsigned inte-
gers packed into a portable, little-endian form, 4 bytes per value.
There is no particular constraint on sizes of keys or values, or number
of records in a database, except for the range limits inherent in
32-bit integers. Consequently, a cdb file may not exceed 2^32-1 bytes
(4 gigabytes).
NOTES
The cdb hash function is simple, fast, and readily factored into basic
machine instructions. When the function is considered on its own mer-
its, outside of the cdb context, the distribution results are usually
neither spectacular nor tragic. In particular -- and common to all
hash functions of this type -- the result is strongly patterned by the
last byte of the key.
For example, consider the following keys and cdb hash values from an
airport codes database:
ABJ 0x0b87b6ac
ABK 0x0b87b6ad
ABL 0x0b87b6aa
ABM 0x0b87b6ab
As can be seen here, the hash value for each of these keys differs only
in the final byte, and according to the last byte of the key. Only the
single tiny final bit of the 32-bit hash separates ABL from ABM, even
though one is in Alaska, and the other is in Australia.
Within the cdb context, this apparent hash deficiency is turned to some
advantage. The last byte of the cdb hash is used only for partitioning
keys into 256 subtables. These lower 8 bits are then discarded before
proceeding to compute the target slot. The result is that slot selec-
tion will be using the higher bits of the hash value, which have been
better mixed, and will generally have more apparently random distribu-
tion properties. Also, keys sharing a common prefix (like those with
``AB'' above) will be segregated into separate subtables, thus reducing
the collision probabilities among them.
Nevertheless, it should not be supposed that a scheme of multiple sub-
tables generally and universally improves hash distribution/performance
when compared to using a single table. Even when using the cdb hash
function, a single hash table will usually have equivalent or better
statistical properties on the same dataset. In fact, the main benefit
of using multiple subtables seems to reside primarily in the memory
required for cdb file generation. Here the scheme of multible subta-
bles permits the use of a smaller memory allocation than would other-
wise be required for a single large table.
The load factor of each hash subtable is not defined by the specifica-
tion. Implementations commonly use a load factor of 0.50. That is,
each subtable is set with twice as many slots as elements in the sub-
table, and half the slots will be empty.
By convention, cdb files are named with a ``.cdb'' extension. There is
otherwise no explicit header or ``magic'' within a cdb file by which it
may be identified.
The cdb file format specification was originally designed by Daniel J.
Bernstein and published in 1996.
AUTHOR
Wayne Marshall, http://b0llix.net/ezcdb/
SEE ALSO
ezcdb(1), http://cr.yp.to/cdb.html
ezcdb-0.01 December 2010 cdb(5)