gpike | a371645 | 2014-03-31 22:10:41 +0000 | [diff] [blame] | 1 | UNDERSTANDING HASH FUNCTIONS |
| 2 | by Geoff Pike |
| 3 | |
gpike | 34c13dd | 2015-03-01 20:04:23 +0000 | [diff] [blame] | 4 | Version 0.2 --- early draft --- comments and questions welcome! |
gpike | a371645 | 2014-03-31 22:10:41 +0000 | [diff] [blame] | 5 | References appear in square brackets. |
| 6 | |
| 7 | 1 INTRODUCTION |
| 8 | |
| 9 | Hashing has proven tremendously useful in constructing various fast |
| 10 | data structures and algorithms. It is typically possible to simplify |
| 11 | the analysis of hash-based algorithms if one assumes that the relevant |
| 12 | hash functions are high quality. At the other extreme, if the |
| 13 | relevant hash functions were always to return the same value, many |
| 14 | hash-based algorithms become algorithms that are slower, simpler, but still well-known. |
| 15 | For example, a chaining hash table devolves into a linked list. |
| 16 | |
| 17 | There are many possible definitions of hash function quality. For |
| 18 | example, one might want a list of keys and their hashes to provide no |
| 19 | pattern that would allow an opponent to predict anything about the |
| 20 | hashes of other keys. Although I cannot prove it, I think I can meet |
| 21 | this and many other definitions of quality with |
| 22 | |
| 23 | f(s) = SHA-3(concatenation of z and s), |
| 24 | |
| 25 | where z is some secret string known only to me. This well-known trick |
| 26 | provides, I think, more high-quality hash functions than anyone will |
| 27 | need, though greater computational power in the future may push us to |
| 28 | replace SHA-3 from time to time. |
| 29 | |
| 30 | In short, discussions about choosing a hash function are almost always |
| 31 | discussions about speed, energy consumption, or similar. Concerns |
| 32 | about hash quality are easy to fix, for a price. |
| 33 | |
| 34 | 2 ANATOMY OF A HASH FUNCTION |
| 35 | |
| 36 | Hash functions that input strings of arbitrary length are written in |
| 37 | terms of an internal state, S. In many cases the internal state is a |
| 38 | fixed number of bits and will fit in machine registers. One generic |
| 39 | sketch of a string hash is: |
| 40 | |
| 41 | let S = some initial value |
| 42 | let c = the length of S in bits |
| 43 | while (input is not exhausted) { |
| 44 | let t = the next c bits of input (padded with zeroes if less than c remain) |
| 45 | S = M(S xor t) |
| 46 | } |
gpike | 34c13dd | 2015-03-01 20:04:23 +0000 | [diff] [blame] | 47 | let n = the number of bytes hashed |
| 48 | return F(S, n) |
gpike | a371645 | 2014-03-31 22:10:41 +0000 | [diff] [blame] | 49 | |
| 50 | where M is a hash function that inputs and outputs c bits, and F is a |
gpike | 34c13dd | 2015-03-01 20:04:23 +0000 | [diff] [blame] | 51 | hash function that inputs c bits (plus, say, 64 for its second argument) |
| 52 | and outputs however many bits one needs to return. In some sense we have |
| 53 | reduced the string-hashing problem to two integer hashing problems. |
gpike | a371645 | 2014-03-31 22:10:41 +0000 | [diff] [blame] | 54 | |
| 55 | 2.1 INTEGER HASHING TECHNIQUES |
| 56 | |
| 57 | A hash function that inputs and outputs the same number of bits, say, |
| 58 | 32, can use reversible bit-twiddling operations, each of which is |
| 59 | "onto" in the mathematical sense. For example, multiplication by an |
| 60 | odd constant is reversible, as all odd numbers are relatively prime to |
| 61 | 2^32. Other commonly used reversible operations include: |
| 62 | o Adding or xoring a constant |
| 63 | o Bitwise rotation or other bitwise permutations |
| 64 | o bit j = (bit j) xor (bit k) for unequal constants j and k |
| 65 | o "Shift mix": S = S xor (S >> k), where k is, say, 17 |
| 66 | o Replacing a fixed-length bit string with its cyclic redundancy |
| 67 | checksum, perhaps via _mm_crc32_u32(f, <some constant>) [Pike] |
| 68 | |
| 69 | Each of the above is a "bad" hash function that inputs and outputs |
| 70 | the same number of bits. One can simply compose two or more of those |
| 71 | bad hash functions to construct a higher-quality hash function. |
| 72 | |
| 73 | One common quality goal for integer hashing (and string hashing) is |
| 74 | that flipping the 19th bit, or any other small change, applied to |
| 75 | multiple input keys, causes a seemingly unpredictable difference each |
| 76 | time. Similarly, any change to an input should lead to a seemingly |
| 77 | unpredictable selection of the output bits to flip. |
| 78 | |
| 79 | Therefore, if we want a high-quality hash function that inputs c bits |
| 80 | and outputs fewer than c bits, we can simply truncate the output of a |
| 81 | high-quality hash function that inputs and outputs c bits. |
| 82 | |
| 83 | To give a concrete example, here is Bob Jenkins' mix(), published in |
gpike | 34c13dd | 2015-03-01 20:04:23 +0000 | [diff] [blame] | 84 | 1996 [Jenkins]. Its input is 96 bits in three 32-bit variables, and its output |
gpike | a371645 | 2014-03-31 22:10:41 +0000 | [diff] [blame] | 85 | is 96 bits. However, one may use a subset of the output bits, as every |
| 86 | output bit is affected by every non-empty subset of the input bits. |
| 87 | |
| 88 | Input: a, b, and c |
| 89 | Algorithm: |
| 90 | a -= b; a -= c; a ^= (c>>13); |
| 91 | b -= c; b -= a; b ^= (a<<8); |
| 92 | c -= a; c -= b; c ^= (b>>13); |
| 93 | a -= b; a -= c; a ^= (c>>12); |
| 94 | b -= c; b -= a; b ^= (a<<16); |
| 95 | c -= a; c -= b; c ^= (b>>5); |
| 96 | a -= b; a -= c; a ^= (c>>3); |
| 97 | b -= c; b -= a; b ^= (a<<10); |
| 98 | c -= a; c -= b; c ^= (b>>15); |
| 99 | Output: a, b, and c |
| 100 | |
| 101 | 2.2 VARIATIONS ON STRING HASHING |
| 102 | |
| 103 | There are three variations on our initial sketch worth noting. |
| 104 | |
| 105 | First, for speed, one can special-case short inputs, as the CityHash |
| 106 | and FarmHash algorithms do. The number of special cases can be |
| 107 | reduced by using loads that may overlap: for example, a hash of a 9- |
| 108 | to 16-byte string can be implemented by a hash that inputs two 8-byte |
| 109 | values (the first 8 and last 8 bytes of the input string) and the string |
| 110 | length [CityHash, FarmHash]. |
| 111 | |
| 112 | Second, one may choose different means of incorporating input bits |
| 113 | into the internal state. One example: the mixing of S and input bits |
| 114 | may be interleaved with the mixing of parts of S and other parts of S. |
| 115 | Another example: the input bits processed in a loop iteration might be |
| 116 | xor'ed into multiple places in S, rather than just one, or might be |
| 117 | hashed with each other before touching S [Murmur]. The advantages and |
| 118 | disadvantages of these are unclear. |
| 119 | |
| 120 | Third, one may repeatedly "squeeze information" from S, by remixing it with |
| 121 | itself and then revealing a subset of S. This is convenient when one would |
| 122 | like a family of hash functions with different output lengths. A special |
| 123 | case of the idea, called the "sponge construction," has been well studied and |
| 124 | adopted by the authors of Keccak and others [SHA-3]. |
| 125 | |
| 126 | 3 HASH FUNCTIONS FOR HASH TABLES |
| 127 | |
| 128 | It isn't hard to find real-life examples where hash tables or the hash |
| 129 | functions for them take more than 5% of a program's CPU time. |
| 130 | Improvements to hash tables and their hash functions are therefore a |
| 131 | classic example of software performance tuning. Unfortunately, the |
| 132 | best choice may be platform-dependent, so to avoid writing your own |
| 133 | collection of #ifdefs, please consider selecting something like the |
| 134 | FarmHash family of hash functions, that supply decent |
| 135 | platform-dependent logic for you. |
| 136 | |
| 137 | To tune a program, often one will replace an existing hash function with a |
| 138 | faster, lower-quality hash function, despite the increased chance of unlucky |
| 139 | or pathological performance problems. Clever algorithms can mitigate this |
| 140 | risk. For example, hash tables can start with one hash function and then |
| 141 | switch to another if things seem to be going poorly. Therefore, one should |
| 142 | rarely plan to spend much CPU time on a secure hash function (such as SHA-3) |
| 143 | or a near-universal hash function (such as VHASH) when seeking the best |
| 144 | possible performance from a hash table. Against that, those types of hash |
| 145 | functions can limit the risk of pathological performance problems when one is |
| 146 | designing around typical hash-based algorithms that stick with a single hash |
| 147 | function no matter how it behaves on the data at hand. |
| 148 | |
| 149 | 4 |
| 150 | |
| 151 | REFERENCES |
| 152 | |
| 153 | [Murmur] Appleby, Austin. https://code.google.com/p/smhasher, |
| 154 | https://sites.google.com/site/murmurhash/ |
| 155 | [SMHasher] Appleby, Austin. https://code.google.com/p/smhasher |
| 156 | [SHA-3] Bertoni, Guido, et al. http://keccak.noekeon.org/ |
| 157 | [Jenkins] Jenkins, Bob. http://burtleburtle.net/bob/hash/doobs.html |
| 158 | [VHASH] Krovetz, Ted. Message authentication on 64-bit architectures. In |
| 159 | Selected Areas of Cryptography – SAC 2006. Springer-Verlag, 2006. |
| 160 | [CityHash] Pike, Geoff and Alakuijala, Jyrki. https://code.google.com/p/cityhash |
| 161 | [FarmHash] Pike, Geoff. https://code.google.com/p/farmhash |
| 162 | [Pike] Pike, Geoff. http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf |