gpike | a371645 | 2014-03-31 22:10:41 +0000 | [diff] [blame^] | 1 | UNDERSTANDING HASH FUNCTIONS |
| 2 | by Geoff Pike |
| 3 | |
| 4 | Version 0.1 --- early draft --- comments and questions welcome! |
| 5 | References appear in square brackets. |
| 6 | |
| 7 | 1 INTRODUCTION |
| 8 | |
| 9 | Hashing has proven tremendously useful in constructing various fast |
| 10 | data structures and algorithms. It is typically possible to simplify |
| 11 | the analysis of hash-based algorithms if one assumes that the relevant |
| 12 | hash functions are high quality. At the other extreme, if the |
| 13 | relevant hash functions were always to return the same value, many |
| 14 | hash-based algorithms become algorithms that are slower, simpler, but still well-known. |
| 15 | For example, a chaining hash table devolves into a linked list. |
| 16 | |
| 17 | There are many possible definitions of hash function quality. For |
| 18 | example, one might want a list of keys and their hashes to provide no |
| 19 | pattern that would allow an opponent to predict anything about the |
| 20 | hashes of other keys. Although I cannot prove it, I think I can meet |
| 21 | this and many other definitions of quality with |
| 22 | |
| 23 | f(s) = SHA-3(concatenation of z and s), |
| 24 | |
| 25 | where z is some secret string known only to me. This well-known trick |
| 26 | provides, I think, more high-quality hash functions than anyone will |
| 27 | need, though greater computational power in the future may push us to |
| 28 | replace SHA-3 from time to time. |
| 29 | |
| 30 | In short, discussions about choosing a hash function are almost always |
| 31 | discussions about speed, energy consumption, or similar. Concerns |
| 32 | about hash quality are easy to fix, for a price. |
| 33 | |
| 34 | 2 ANATOMY OF A HASH FUNCTION |
| 35 | |
| 36 | Hash functions that input strings of arbitrary length are written in |
| 37 | terms of an internal state, S. In many cases the internal state is a |
| 38 | fixed number of bits and will fit in machine registers. One generic |
| 39 | sketch of a string hash is: |
| 40 | |
| 41 | let S = some initial value |
| 42 | let c = the length of S in bits |
| 43 | while (input is not exhausted) { |
| 44 | let t = the next c bits of input (padded with zeroes if less than c remain) |
| 45 | S = M(S xor t) |
| 46 | } |
| 47 | return F(S) |
| 48 | |
| 49 | where M is a hash function that inputs and outputs c bits, and F is a |
| 50 | hash function that inputs c bits and outputs however many bits one |
| 51 | needs to return. In some sense we have reduced the string-hashing |
| 52 | problem to two integer hashing problems. |
| 53 | |
| 54 | 2.1 INTEGER HASHING TECHNIQUES |
| 55 | |
| 56 | A hash function that inputs and outputs the same number of bits, say, |
| 57 | 32, can use reversible bit-twiddling operations, each of which is |
| 58 | "onto" in the mathematical sense. For example, multiplication by an |
| 59 | odd constant is reversible, as all odd numbers are relatively prime to |
| 60 | 2^32. Other commonly used reversible operations include: |
| 61 | o Adding or xoring a constant |
| 62 | o Bitwise rotation or other bitwise permutations |
| 63 | o bit j = (bit j) xor (bit k) for unequal constants j and k |
| 64 | o "Shift mix": S = S xor (S >> k), where k is, say, 17 |
| 65 | o Replacing a fixed-length bit string with its cyclic redundancy |
| 66 | checksum, perhaps via _mm_crc32_u32(f, <some constant>) [Pike] |
| 67 | |
| 68 | Each of the above is a "bad" hash function that inputs and outputs |
| 69 | the same number of bits. One can simply compose two or more of those |
| 70 | bad hash functions to construct a higher-quality hash function. |
| 71 | |
| 72 | One common quality goal for integer hashing (and string hashing) is |
| 73 | that flipping the 19th bit, or any other small change, applied to |
| 74 | multiple input keys, causes a seemingly unpredictable difference each |
| 75 | time. Similarly, any change to an input should lead to a seemingly |
| 76 | unpredictable selection of the output bits to flip. |
| 77 | |
| 78 | Therefore, if we want a high-quality hash function that inputs c bits |
| 79 | and outputs fewer than c bits, we can simply truncate the output of a |
| 80 | high-quality hash function that inputs and outputs c bits. |
| 81 | |
| 82 | To give a concrete example, here is Bob Jenkins' mix(), published in |
| 83 | 1996 [Jenkins]. It's input is 96 bits in three 32-bit variables, and its output |
| 84 | is 96 bits. However, one may use a subset of the output bits, as every |
| 85 | output bit is affected by every non-empty subset of the input bits. |
| 86 | |
| 87 | Input: a, b, and c |
| 88 | Algorithm: |
| 89 | a -= b; a -= c; a ^= (c>>13); |
| 90 | b -= c; b -= a; b ^= (a<<8); |
| 91 | c -= a; c -= b; c ^= (b>>13); |
| 92 | a -= b; a -= c; a ^= (c>>12); |
| 93 | b -= c; b -= a; b ^= (a<<16); |
| 94 | c -= a; c -= b; c ^= (b>>5); |
| 95 | a -= b; a -= c; a ^= (c>>3); |
| 96 | b -= c; b -= a; b ^= (a<<10); |
| 97 | c -= a; c -= b; c ^= (b>>15); |
| 98 | Output: a, b, and c |
| 99 | |
| 100 | 2.2 VARIATIONS ON STRING HASHING |
| 101 | |
| 102 | There are three variations on our initial sketch worth noting. |
| 103 | |
| 104 | First, for speed, one can special-case short inputs, as the CityHash |
| 105 | and FarmHash algorithms do. The number of special cases can be |
| 106 | reduced by using loads that may overlap: for example, a hash of a 9- |
| 107 | to 16-byte string can be implemented by a hash that inputs two 8-byte |
| 108 | values (the first 8 and last 8 bytes of the input string) and the string |
| 109 | length [CityHash, FarmHash]. |
| 110 | |
| 111 | Second, one may choose different means of incorporating input bits |
| 112 | into the internal state. One example: the mixing of S and input bits |
| 113 | may be interleaved with the mixing of parts of S and other parts of S. |
| 114 | Another example: the input bits processed in a loop iteration might be |
| 115 | xor'ed into multiple places in S, rather than just one, or might be |
| 116 | hashed with each other before touching S [Murmur]. The advantages and |
| 117 | disadvantages of these are unclear. |
| 118 | |
| 119 | Third, one may repeatedly "squeeze information" from S, by remixing it with |
| 120 | itself and then revealing a subset of S. This is convenient when one would |
| 121 | like a family of hash functions with different output lengths. A special |
| 122 | case of the idea, called the "sponge construction," has been well studied and |
| 123 | adopted by the authors of Keccak and others [SHA-3]. |
| 124 | |
| 125 | 3 HASH FUNCTIONS FOR HASH TABLES |
| 126 | |
| 127 | It isn't hard to find real-life examples where hash tables or the hash |
| 128 | functions for them take more than 5% of a program's CPU time. |
| 129 | Improvements to hash tables and their hash functions are therefore a |
| 130 | classic example of software performance tuning. Unfortunately, the |
| 131 | best choice may be platform-dependent, so to avoid writing your own |
| 132 | collection of #ifdefs, please consider selecting something like the |
| 133 | FarmHash family of hash functions, that supply decent |
| 134 | platform-dependent logic for you. |
| 135 | |
| 136 | To tune a program, often one will replace an existing hash function with a |
| 137 | faster, lower-quality hash function, despite the increased chance of unlucky |
| 138 | or pathological performance problems. Clever algorithms can mitigate this |
| 139 | risk. For example, hash tables can start with one hash function and then |
| 140 | switch to another if things seem to be going poorly. Therefore, one should |
| 141 | rarely plan to spend much CPU time on a secure hash function (such as SHA-3) |
| 142 | or a near-universal hash function (such as VHASH) when seeking the best |
| 143 | possible performance from a hash table. Against that, those types of hash |
| 144 | functions can limit the risk of pathological performance problems when one is |
| 145 | designing around typical hash-based algorithms that stick with a single hash |
| 146 | function no matter how it behaves on the data at hand. |
| 147 | |
| 148 | 4 |
| 149 | |
| 150 | REFERENCES |
| 151 | |
| 152 | [Murmur] Appleby, Austin. https://code.google.com/p/smhasher, |
| 153 | https://sites.google.com/site/murmurhash/ |
| 154 | [SMHasher] Appleby, Austin. https://code.google.com/p/smhasher |
| 155 | [SHA-3] Bertoni, Guido, et al. http://keccak.noekeon.org/ |
| 156 | [Jenkins] Jenkins, Bob. http://burtleburtle.net/bob/hash/doobs.html |
| 157 | [VHASH] Krovetz, Ted. Message authentication on 64-bit architectures. In |
| 158 | Selected Areas of Cryptography – SAC 2006. Springer-Verlag, 2006. |
| 159 | [CityHash] Pike, Geoff and Alakuijala, Jyrki. https://code.google.com/p/cityhash |
| 160 | [FarmHash] Pike, Geoff. https://code.google.com/p/farmhash |
| 161 | [Pike] Pike, Geoff. http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf |