blob: 19091cb273d5a354324047ba6d1cc6801d776b27 [file] [log] [blame]
gpikea3716452014-03-31 22:10:41 +00001UNDERSTANDING HASH FUNCTIONS
2by Geoff Pike
3
4Version 0.1 --- early draft --- comments and questions welcome!
5References appear in square brackets.
6
71 INTRODUCTION
8
9Hashing has proven tremendously useful in constructing various fast
10data structures and algorithms. It is typically possible to simplify
11the analysis of hash-based algorithms if one assumes that the relevant
12hash functions are high quality. At the other extreme, if the
13relevant hash functions were always to return the same value, many
14hash-based algorithms become algorithms that are slower, simpler, but still well-known.
15For example, a chaining hash table devolves into a linked list.
16
17There are many possible definitions of hash function quality. For
18example, one might want a list of keys and their hashes to provide no
19pattern that would allow an opponent to predict anything about the
20hashes of other keys. Although I cannot prove it, I think I can meet
21this and many other definitions of quality with
22
23 f(s) = SHA-3(concatenation of z and s),
24
25where z is some secret string known only to me. This well-known trick
26provides, I think, more high-quality hash functions than anyone will
27need, though greater computational power in the future may push us to
28replace SHA-3 from time to time.
29
30In short, discussions about choosing a hash function are almost always
31discussions about speed, energy consumption, or similar. Concerns
32about hash quality are easy to fix, for a price.
33
342 ANATOMY OF A HASH FUNCTION
35
36Hash functions that input strings of arbitrary length are written in
37terms of an internal state, S. In many cases the internal state is a
38fixed number of bits and will fit in machine registers. One generic
39sketch of a string hash is:
40
41 let S = some initial value
42 let c = the length of S in bits
43 while (input is not exhausted) {
44 let t = the next c bits of input (padded with zeroes if less than c remain)
45 S = M(S xor t)
46 }
47 return F(S)
48
49where M is a hash function that inputs and outputs c bits, and F is a
50hash function that inputs c bits and outputs however many bits one
51needs to return. In some sense we have reduced the string-hashing
52problem to two integer hashing problems.
53
542.1 INTEGER HASHING TECHNIQUES
55
56A hash function that inputs and outputs the same number of bits, say,
5732, can use reversible bit-twiddling operations, each of which is
58"onto" in the mathematical sense. For example, multiplication by an
59odd constant is reversible, as all odd numbers are relatively prime to
602^32. Other commonly used reversible operations include:
61 o Adding or xoring a constant
62 o Bitwise rotation or other bitwise permutations
63 o bit j = (bit j) xor (bit k) for unequal constants j and k
64 o "Shift mix": S = S xor (S >> k), where k is, say, 17
65 o Replacing a fixed-length bit string with its cyclic redundancy
66 checksum, perhaps via _mm_crc32_u32(f, <some constant>) [Pike]
67
68Each of the above is a "bad" hash function that inputs and outputs
69the same number of bits. One can simply compose two or more of those
70bad hash functions to construct a higher-quality hash function.
71
72One common quality goal for integer hashing (and string hashing) is
73that flipping the 19th bit, or any other small change, applied to
74multiple input keys, causes a seemingly unpredictable difference each
75time. Similarly, any change to an input should lead to a seemingly
76unpredictable selection of the output bits to flip.
77
78Therefore, if we want a high-quality hash function that inputs c bits
79and outputs fewer than c bits, we can simply truncate the output of a
80high-quality hash function that inputs and outputs c bits.
81
82To give a concrete example, here is Bob Jenkins' mix(), published in
831996 [Jenkins]. It's input is 96 bits in three 32-bit variables, and its output
84is 96 bits. However, one may use a subset of the output bits, as every
85output bit is affected by every non-empty subset of the input bits.
86
87 Input: a, b, and c
88 Algorithm:
89 a -= b; a -= c; a ^= (c>>13);
90 b -= c; b -= a; b ^= (a<<8);
91 c -= a; c -= b; c ^= (b>>13);
92 a -= b; a -= c; a ^= (c>>12);
93 b -= c; b -= a; b ^= (a<<16);
94 c -= a; c -= b; c ^= (b>>5);
95 a -= b; a -= c; a ^= (c>>3);
96 b -= c; b -= a; b ^= (a<<10);
97 c -= a; c -= b; c ^= (b>>15);
98 Output: a, b, and c
99
1002.2 VARIATIONS ON STRING HASHING
101
102There are three variations on our initial sketch worth noting.
103
104First, for speed, one can special-case short inputs, as the CityHash
105and FarmHash algorithms do. The number of special cases can be
106reduced by using loads that may overlap: for example, a hash of a 9-
107to 16-byte string can be implemented by a hash that inputs two 8-byte
108values (the first 8 and last 8 bytes of the input string) and the string
109length [CityHash, FarmHash].
110
111Second, one may choose different means of incorporating input bits
112into the internal state. One example: the mixing of S and input bits
113may be interleaved with the mixing of parts of S and other parts of S.
114Another example: the input bits processed in a loop iteration might be
115xor'ed into multiple places in S, rather than just one, or might be
116hashed with each other before touching S [Murmur]. The advantages and
117disadvantages of these are unclear.
118
119Third, one may repeatedly "squeeze information" from S, by remixing it with
120itself and then revealing a subset of S. This is convenient when one would
121like a family of hash functions with different output lengths. A special
122case of the idea, called the "sponge construction," has been well studied and
123adopted by the authors of Keccak and others [SHA-3].
124
1253 HASH FUNCTIONS FOR HASH TABLES
126
127It isn't hard to find real-life examples where hash tables or the hash
128functions for them take more than 5% of a program's CPU time.
129Improvements to hash tables and their hash functions are therefore a
130classic example of software performance tuning. Unfortunately, the
131best choice may be platform-dependent, so to avoid writing your own
132collection of #ifdefs, please consider selecting something like the
133FarmHash family of hash functions, that supply decent
134platform-dependent logic for you.
135
136To tune a program, often one will replace an existing hash function with a
137faster, lower-quality hash function, despite the increased chance of unlucky
138or pathological performance problems. Clever algorithms can mitigate this
139risk. For example, hash tables can start with one hash function and then
140switch to another if things seem to be going poorly. Therefore, one should
141rarely plan to spend much CPU time on a secure hash function (such as SHA-3)
142or a near-universal hash function (such as VHASH) when seeking the best
143possible performance from a hash table. Against that, those types of hash
144functions can limit the risk of pathological performance problems when one is
145designing around typical hash-based algorithms that stick with a single hash
146function no matter how it behaves on the data at hand.
147
1484
149
150REFERENCES
151
152[Murmur] Appleby, Austin. https://code.google.com/p/smhasher,
153 https://sites.google.com/site/murmurhash/
154[SMHasher] Appleby, Austin. https://code.google.com/p/smhasher
155[SHA-3] Bertoni, Guido, et al. http://keccak.noekeon.org/
156[Jenkins] Jenkins, Bob. http://burtleburtle.net/bob/hash/doobs.html
157[VHASH] Krovetz, Ted. Message authentication on 64-bit architectures. In
158 Selected Areas of Cryptography – SAC 2006. Springer-Verlag, 2006.
159[CityHash] Pike, Geoff and Alakuijala, Jyrki. https://code.google.com/p/cityhash
160[FarmHash] Pike, Geoff. https://code.google.com/p/farmhash
161[Pike] Pike, Geoff. http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf