blob: 8d203c456cef8979f1e16e3a5da3f0f2bfc7d9b2 [file] [log] [blame]
gpikea3716452014-03-31 22:10:41 +00001UNDERSTANDING HASH FUNCTIONS
2by Geoff Pike
3
gpike34c13dd2015-03-01 20:04:23 +00004Version 0.2 --- early draft --- comments and questions welcome!
gpikea3716452014-03-31 22:10:41 +00005References appear in square brackets.
6
71 INTRODUCTION
8
9Hashing has proven tremendously useful in constructing various fast
10data structures and algorithms. It is typically possible to simplify
11the analysis of hash-based algorithms if one assumes that the relevant
12hash functions are high quality. At the other extreme, if the
13relevant hash functions were always to return the same value, many
14hash-based algorithms become algorithms that are slower, simpler, but still well-known.
15For example, a chaining hash table devolves into a linked list.
16
17There are many possible definitions of hash function quality. For
18example, one might want a list of keys and their hashes to provide no
19pattern that would allow an opponent to predict anything about the
20hashes of other keys. Although I cannot prove it, I think I can meet
21this and many other definitions of quality with
22
23 f(s) = SHA-3(concatenation of z and s),
24
25where z is some secret string known only to me. This well-known trick
26provides, I think, more high-quality hash functions than anyone will
27need, though greater computational power in the future may push us to
28replace SHA-3 from time to time.
29
30In short, discussions about choosing a hash function are almost always
31discussions about speed, energy consumption, or similar. Concerns
32about hash quality are easy to fix, for a price.
33
342 ANATOMY OF A HASH FUNCTION
35
36Hash functions that input strings of arbitrary length are written in
37terms of an internal state, S. In many cases the internal state is a
38fixed number of bits and will fit in machine registers. One generic
39sketch of a string hash is:
40
41 let S = some initial value
42 let c = the length of S in bits
43 while (input is not exhausted) {
44 let t = the next c bits of input (padded with zeroes if less than c remain)
45 S = M(S xor t)
46 }
gpike34c13dd2015-03-01 20:04:23 +000047 let n = the number of bytes hashed
48 return F(S, n)
gpikea3716452014-03-31 22:10:41 +000049
50where M is a hash function that inputs and outputs c bits, and F is a
gpike34c13dd2015-03-01 20:04:23 +000051hash function that inputs c bits (plus, say, 64 for its second argument)
52and outputs however many bits one needs to return. In some sense we have
53reduced the string-hashing problem to two integer hashing problems.
gpikea3716452014-03-31 22:10:41 +000054
552.1 INTEGER HASHING TECHNIQUES
56
57A hash function that inputs and outputs the same number of bits, say,
5832, can use reversible bit-twiddling operations, each of which is
59"onto" in the mathematical sense. For example, multiplication by an
60odd constant is reversible, as all odd numbers are relatively prime to
612^32. Other commonly used reversible operations include:
62 o Adding or xoring a constant
63 o Bitwise rotation or other bitwise permutations
64 o bit j = (bit j) xor (bit k) for unequal constants j and k
65 o "Shift mix": S = S xor (S >> k), where k is, say, 17
66 o Replacing a fixed-length bit string with its cyclic redundancy
67 checksum, perhaps via _mm_crc32_u32(f, <some constant>) [Pike]
68
69Each of the above is a "bad" hash function that inputs and outputs
70the same number of bits. One can simply compose two or more of those
71bad hash functions to construct a higher-quality hash function.
72
73One common quality goal for integer hashing (and string hashing) is
74that flipping the 19th bit, or any other small change, applied to
75multiple input keys, causes a seemingly unpredictable difference each
76time. Similarly, any change to an input should lead to a seemingly
77unpredictable selection of the output bits to flip.
78
79Therefore, if we want a high-quality hash function that inputs c bits
80and outputs fewer than c bits, we can simply truncate the output of a
81high-quality hash function that inputs and outputs c bits.
82
83To give a concrete example, here is Bob Jenkins' mix(), published in
gpike34c13dd2015-03-01 20:04:23 +0000841996 [Jenkins]. Its input is 96 bits in three 32-bit variables, and its output
gpikea3716452014-03-31 22:10:41 +000085is 96 bits. However, one may use a subset of the output bits, as every
86output bit is affected by every non-empty subset of the input bits.
87
88 Input: a, b, and c
89 Algorithm:
90 a -= b; a -= c; a ^= (c>>13);
91 b -= c; b -= a; b ^= (a<<8);
92 c -= a; c -= b; c ^= (b>>13);
93 a -= b; a -= c; a ^= (c>>12);
94 b -= c; b -= a; b ^= (a<<16);
95 c -= a; c -= b; c ^= (b>>5);
96 a -= b; a -= c; a ^= (c>>3);
97 b -= c; b -= a; b ^= (a<<10);
98 c -= a; c -= b; c ^= (b>>15);
99 Output: a, b, and c
100
1012.2 VARIATIONS ON STRING HASHING
102
103There are three variations on our initial sketch worth noting.
104
105First, for speed, one can special-case short inputs, as the CityHash
106and FarmHash algorithms do. The number of special cases can be
107reduced by using loads that may overlap: for example, a hash of a 9-
108to 16-byte string can be implemented by a hash that inputs two 8-byte
109values (the first 8 and last 8 bytes of the input string) and the string
110length [CityHash, FarmHash].
111
112Second, one may choose different means of incorporating input bits
113into the internal state. One example: the mixing of S and input bits
114may be interleaved with the mixing of parts of S and other parts of S.
115Another example: the input bits processed in a loop iteration might be
116xor'ed into multiple places in S, rather than just one, or might be
117hashed with each other before touching S [Murmur]. The advantages and
118disadvantages of these are unclear.
119
120Third, one may repeatedly "squeeze information" from S, by remixing it with
121itself and then revealing a subset of S. This is convenient when one would
122like a family of hash functions with different output lengths. A special
123case of the idea, called the "sponge construction," has been well studied and
124adopted by the authors of Keccak and others [SHA-3].
125
1263 HASH FUNCTIONS FOR HASH TABLES
127
128It isn't hard to find real-life examples where hash tables or the hash
129functions for them take more than 5% of a program's CPU time.
130Improvements to hash tables and their hash functions are therefore a
131classic example of software performance tuning. Unfortunately, the
132best choice may be platform-dependent, so to avoid writing your own
133collection of #ifdefs, please consider selecting something like the
134FarmHash family of hash functions, that supply decent
135platform-dependent logic for you.
136
137To tune a program, often one will replace an existing hash function with a
138faster, lower-quality hash function, despite the increased chance of unlucky
139or pathological performance problems. Clever algorithms can mitigate this
140risk. For example, hash tables can start with one hash function and then
141switch to another if things seem to be going poorly. Therefore, one should
142rarely plan to spend much CPU time on a secure hash function (such as SHA-3)
143or a near-universal hash function (such as VHASH) when seeking the best
144possible performance from a hash table. Against that, those types of hash
145functions can limit the risk of pathological performance problems when one is
146designing around typical hash-based algorithms that stick with a single hash
147function no matter how it behaves on the data at hand.
148
1494
150
151REFERENCES
152
153[Murmur] Appleby, Austin. https://code.google.com/p/smhasher,
154 https://sites.google.com/site/murmurhash/
155[SMHasher] Appleby, Austin. https://code.google.com/p/smhasher
156[SHA-3] Bertoni, Guido, et al. http://keccak.noekeon.org/
157[Jenkins] Jenkins, Bob. http://burtleburtle.net/bob/hash/doobs.html
158[VHASH] Krovetz, Ted. Message authentication on 64-bit architectures. In
159 Selected Areas of Cryptography SAC 2006. Springer-Verlag, 2006.
160[CityHash] Pike, Geoff and Alakuijala, Jyrki. https://code.google.com/p/cityhash
161[FarmHash] Pike, Geoff. https://code.google.com/p/farmhash
162[Pike] Pike, Geoff. http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf