cdb(5) ezcdb cdb(5)
NAME
cdb - constant database file format
DESCRIPTION
A cdb file is implemented as a partitioned hash table. It provides a
set of mappings for record ``keys'' to record ``values'', where keys
and records are any arbitrary strings of bytes.
A cdb file is comprised of three sections:
P R H
Where:
P[0..255]
An array of 256 pointer elements providing a ``table of con-
tents'' for the 256 hash subtables in section H. Each element
P[ntab] is eight bytes: 4 bytes offset to start of hash subtable
ntab; 4 bytes number of slots in hash subtable ntab. A subtable
may have zero slots.
R[0..(nrecs - 1)]
The sequential collection of nrecs records in the database.
Storage for each record R[nrec] is a concatenation of: 4 bytes
key length; 4 bytes value length; record key; and record value.
H[0..255][0..(tslots - 1)]
An array of 256 hash subtables. Each subtable H[ntab] is com-
prised of P[ntab].tslots slots. Each slot is eight bytes: 4
bytes hash value; 4 bytes offset to associated record in section
R. An offset value of 0 indicates an empty slot, and no associ-
ated record exists.
Hash values are calculated for any given byte string key of length klen
according to the following algorithm:
uint32_t
cdb_hash(unsigned char *key, uint32_t klen)
{
uint32_t h = 5381;
while(klen){
h = h * 33;
h = h ^ *key;
++key; --klen;
}
return h;
}
The subtable ntab for a given key is calculated from the relation:
ntab = (hash % 256).
The subtable offset and number of subtable slots is read from P[ntab].
The target slot for a given key in a subtable is computed from:
target = ((hash >> 8) % tslots)
where tslots is the number of slots in the subtable as read from
P[ntab].
A record is located according to an open-addressing hash scheme as fol-
lows. First compute the hash value, subtable, and target slot for the
record key. Begin probing in subtable H[ntab], starting with the tar-
get slot, then the next slot, and so forth, while checking at each step
in section R for a matching key. Continue until either the record with
matching key is found, or until an empty slot is encountered.
All offset, length, and hash values are stored as 32-bit unsigned inte-
gers packed into a portable, little-endian form, 4 bytes per value.
There is no particular constraint on sizes of keys or values, or number
of records in a database, except for the range limits inherent in
32-bit integers. Consequently, a cdb file may not exceed 2^32-1 bytes
(4 gigabytes).
NOTES
The cdb hash function is simple, fast, and readily factored into basic
machine instructions. When the function is considered on its own mer-
its, outside of the cdb context, the distribution results are usually
neither spectacular nor tragic. In particular -- and common to all
hash functions of this type -- the result is strongly patterned by the
last byte of the key.
For example, consider the following keys and cdb hash values from an
airport codes database:
ABJ 0x0b87b6ac
ABK 0x0b87b6ad
ABL 0x0b87b6aa
ABM 0x0b87b6ab
As can be seen here, the hash value for each of these keys di