cdb(5) ezcdb cdb(5) NAME cdb - constant database file format DESCRIPTION A cdb file is implemented as a partitioned hash table. It provides a set of mappings for record ``keys'' to record ``values'', where keys and records are any arbitrary strings of bytes. A cdb file is comprised of three sections: P R H Where: P[0..255] An array of 256 pointer elements providing a ``table of con- tents'' for the 256 hash subtables in section H. Each element P[ntab] is eight bytes: 4 bytes offset to start of hash subtable ntab; 4 bytes number of slots in hash subtable ntab. A subtable may have zero slots. R[0..(nrecs - 1)] The sequential collection of nrecs records in the database. Storage for each record R[nrec] is a concatenation of: 4 bytes key length; 4 bytes value length; record key; and record value. H[0..255][0..(tslots - 1)] An array of 256 hash subtables. Each subtable H[ntab] is com- prised of P[ntab].tslots slots. Each slot is eight bytes: 4 bytes hash value; 4 bytes offset to associated record in section R. An offset value of 0 indicates an empty slot, and no associ- ated record exists. Hash values are calculated for any given byte string key of length klen according to the following algorithm: uint32_t cdb_hash(unsigned char *key, uint32_t klen) { uint32_t h = 5381; while(klen){ h = h * 33; h = h ^ *key; ++key; --klen; } return h; } The subtable ntab for a given key is calculated from the relation: ntab = (hash % 256). The subtable offset and number of subtable slots is read from P[ntab]. The target slot for a given key in a subtable is computed from: target = ((hash >> 8) % tslots) where tslots is the number of slots in the subtable as read from P[ntab]. A record is located according to an open-addressing hash scheme as fol- lows. First compute the hash value, subtable, and target slot for the record key. Begin probing in subtable H[ntab], starting with the tar- get slot, then the next slot, and so forth, while checking at each step in section R for a matching key. Continue until either the record with matching key is found, or until an empty slot is encountered. All offset, length, and hash values are stored as 32-bit unsigned inte- gers packed into a portable, little-endian form, 4 bytes per value. There is no particular constraint on sizes of keys or values, or number of records in a database, except for the range limits inherent in 32-bit integers. Consequently, a cdb file may not exceed 2^32-1 bytes (4 gigabytes). NOTES The cdb hash function is simple, fast, and readily factored into basic machine instructions. When the function is considered on its own mer- its, outside of the cdb context, the distribution results are usually neither spectacular nor tragic. In particular -- and common to all hash functions of this type -- the result is strongly patterned by the last byte of the key. For example, consider the following keys and cdb hash values from an airport codes database: ABJ 0x0b87b6ac ABK 0x0b87b6ad ABL 0x0b87b6aa ABM 0x0b87b6ab As can be seen here, the hash value for each of these keys differs only in the final byte, and according to the last byte of the key. Only the single tiny final bit of the 32-bit hash separates ABL from ABM, even though one is in Alaska, and the other is in Australia. Within the cdb context, this apparent hash deficiency is turned to some advantage. The last byte of the cdb hash is used only for partitioning keys into 256 subtables. These lower 8 bits are then discarded before proceeding to compute the target slot. The result is that slot selec- tion will be using the higher bits of the hash value, which have been better mixed, and will generally have more apparently random distribu- tion properties. Also, keys sharing a common prefix (like those with ``AB'' above) will be segregated into separate subtables, thus reducing the collision probabilities among them. Nevertheless, it should not be supposed that a scheme of multiple sub- tables generally and universally improves hash distribution/performance when compared to using a single table. Even when using the cdb hash function, a single hash table will usually have equivalent or better statistical properties on the same dataset. In fact, the main benefit of using multiple subtables seems to reside primarily in the memory required for cdb file generation. Here the scheme of multible subta- bles permits the use of a smaller memory allocation than would other- wise be required for a single large table. The load factor of each hash subtable is not defined by the specifica- tion. Implementations commonly use a load factor of 0.50. That is, each subtable is set with twice as many slots as elements in the sub- table, and half the slots will be empty. By convention, cdb files are named with a ``.cdb'' extension. There is otherwise no explicit header or ``magic'' within a cdb file by which it may be identified. The cdb file format specification was originally designed by Daniel J. Bernstein and published in 1996. AUTHOR Wayne Marshall, http://b0llix.net/ezcdb/ SEE ALSO ezcdb(1), http://cr.yp.to/cdb.html ezcdb-0.01 December 2010 cdb(5)