hdb(5) hdb hdb(5) NAME hdb - hash database (hdb32) file format DESCRIPTION An hdb file is implemented as a partitioned hash table. It provides a set of mappings for record ``keys'' to record ``values'', where keys and records are any arbitrary strings of bytes. An hdb file is implemented as a ``write-once, read-many'' datastore. It is generally used by applications requiring fast read access to ``constant'' data, and is modeled after the cdb(5) constant database specification. A hdb file is comprised of six sections: I M P C R H Where: I: identifier A file type identifier of 16 bytes. The identifier for files conforming to this specification is ``hdb32/1.0\0\0\0\0\0\0\0''. M: metadata File metadata of 8 bytes consisting of: 4 bytes nrecs total records in database; and 4 bytes offset to start of record sec- tion R. P: hash subtable pointers An array of 8 pointer elements providing a ``table of contents'' for the 8 hash subtables in section H. Each element P[ntab] is eight bytes: 4 bytes number of slots in hash subtable ntab; 4 bytes offset to start of hash subtable ntab. A subtable may have zero slots. C: descriptive comment An optional comment of any length, normally used to describe the dataset. R: records The sequential collection of nrecs records in the database. Storage for each record R[nrec] is a concatenation of: 3 bytes key length; 3 bytes value length; record key; and record value. H: hash tables An array of 8 hash subtables. Each subtable H[ntab] is com- prised of P[ntab].tslots slots. Each slot is eight bytes: 4 bytes hash value; 4 bytes offset to associated record in section R. An offset value of 0 indicates an empty slot, and no associ- ated record exists. Hash values are calculated for any given byte string key of length klen according to the following algorithm: uint32_t hdb32_hash(unsigned char *key, uint32_t klen) { uint32_t h = 0; while(klen){ h = h ^ *key; h = h * 37; ++key; --klen; } return h; } The subtable ntab for a given key is calculated from the relation: ntab = (hash % 8). The subtable offset and number of subtable slots is read from P[ntab]. The target slot for a given key in a subtable is computed from: target = ((HMIX(hash) >> 3) % tslots) where tslots is the number of slots in the subtable as read from P[ntab], and HMIX(hash) is a bit operation defined as: (hash >> 13) ^ hash A record is located according to an open-addressing hash scheme as fol- lows. First compute the hash value, subtable, and target slot for the record key. Begin probing in subtable H[ntab], starting with the tar- get slot, then the next slot, and so forth, while checking at each step in section R for a matching key. Continue until either the record with matching key is found, or until an empty slot is encountered. All offset and hash values are stored as 32-bit unsigned integers packed into a portable, little-endian form, 4 bytes per value. Length values for keys and values are stored as 24-bit unsigned integers, packed into a portable, little-endian form, 3 bytes per value. A hdb32 file may not exceed (2^32 - 1) bytes (4 gigabytes). Lengths of indi- vidual keys and values may not exceed (2^24 - 1) bytes (16 megabytes). NOTES This specification is qualified as the ``hdb32'' version of the hdb file format. It defines the standard hdb specification for a maximum file size based on 32-bit unsigned integers. As the general hdb format may be adjusted to suit other integer types in the future, those speci- fications should be qualified accordingly. The hdb32 hash function is simple, reasonably fast, and readily fac- tored into basic machine instructions. Otherwise it is pathetically unremarkable, particularly given the man-months wasted on testing and devising numerous other alternatives and parameters. The function itself is mostly just the standard textbook hash, select- ing its mix multiplier (37) from among the usual suspects (31, 33, 37). It differs only slightly from the K&R form by inverting the order of mix and combine operations. Even with this reordering, the hash result is still strongly patterned by the last byte of the key. For example, consider the following keys and hdb32 hash values from an airport codes database: ABJ 0x0030fbad ABK 0x0030fb88 ABL 0x0030fc8b ABM 0x0030fc66 As can be seen here, the hash value for each of these keys differs gen- erally in only the final byte, and according to the last byte of the key. No avalanche is at play, no randomness whatsoever in the sequenced progression of bit patterns. Yet despite a lack of aesthetic and theoretical appeal, this function does appear to provide adequate performance within the hdb32 context. The 3 bits used for subtable partitioning may provide for some disper- sion of aggregate clusters, and the postmix applied before slot selec- tion distributes some of the better-mixed bits in the higher bytes down into the lower bits. Empirical observations suggest that the hdb32 hash/subtable strategy provides for consistent statistical behavior over widely varying types of record input. It should not be supposed that the use of multiple subtables generally and universally improves hash distribution/performance when compared to use of a single table. (Nor has my own testing revealed some magic, prime number of subtables that would consistently outperform either a single table or power-of-2 multiple.) In fact, given the same hash function, a single hash table will generally have equivalent or better statistical properties on the same dataset when compared to multiple tables. The primary benefit of using multiple subtables is actually found in the conservation of memory during hdb file generation. By using 8 sub- tables, the hdb-make(1) operation uses a much smaller memory allocation than would otherwise be required for a single large table. The load factor of each hash subtable is not defined by the specifica- tion. The reference implementation uses a load factor of 0.50. That is, each subtable is set with twice as many slots as elements in the subtable, and half the slots will be empty. By convention, hdb32 files are named with an ``.hdb'' extension. (It is suggested that an ``.hdbx'' extension be used to differentiate files following some future hdb64 specification.) The hdb file format specification was first published with the hdb(1) reference implementation in December 2010. AUTHOR Wayne Marshall, http://b0llix.net/hdb/ SEE ALSO hdb(1). hdb-0.03 January 2013 hdb(5)