Blame - Understanding_Hash_Functions - chromium.googlesource.com/external/github.com/google/farmhash

blob: 8d203c456cef8979f1e16e3a5da3f0f2bfc7d9b2 [file] [log] [blame]

gpike	a371645	2014-03-31 22:10:41 +0000	[diff] [blame]	1	UNDERSTANDING HASH FUNCTIONS
				2	by Geoff Pike
				3
gpike	34c13dd	2015-03-01 20:04:23 +0000	[diff] [blame]	4	Version 0.2 --- early draft --- comments and questions welcome!
gpike	a371645	2014-03-31 22:10:41 +0000	[diff] [blame]	5	References appear in square brackets.
				6
				7	1 INTRODUCTION
				8
				9	Hashing has proven tremendously useful in constructing various fast
				10	data structures and algorithms. It is typically possible to simplify
				11	the analysis of hash-based algorithms if one assumes that the relevant
				12	hash functions are high quality. At the other extreme, if the
				13	relevant hash functions were always to return the same value, many
				14	hash-based algorithms become algorithms that are slower, simpler, but still well-known.
				15	For example, a chaining hash table devolves into a linked list.
				16
				17	There are many possible definitions of hash function quality. For
				18	example, one might want a list of keys and their hashes to provide no
				19	pattern that would allow an opponent to predict anything about the
				20	hashes of other keys. Although I cannot prove it, I think I can meet
				21	this and many other definitions of quality with
				22
				23	f(s) = SHA-3(concatenation of z and s),
				24
				25	where z is some secret string known only to me. This well-known trick
				26	provides, I think, more high-quality hash functions than anyone will
				27	need, though greater computational power in the future may push us to
				28	replace SHA-3 from time to time.
				29
				30	In short, discussions about choosing a hash function are almost always
				31	discussions about speed, energy consumption, or similar. Concerns
				32	about hash quality are easy to fix, for a price.
				33
				34	2 ANATOMY OF A HASH FUNCTION
				35
				36	Hash functions that input strings of arbitrary length are written in
				37	terms of an internal state, S. In many cases the internal state is a
				38	fixed number of bits and will fit in machine registers. One generic
				39	sketch of a string hash is:
				40
				41	let S = some initial value
				42	let c = the length of S in bits
				43	while (input is not exhausted) {
				44	let t = the next c bits of input (padded with zeroes if less than c remain)
				45	S = M(S xor t)
				46	}
gpike	34c13dd	2015-03-01 20:04:23 +0000	[diff] [blame]	47	let n = the number of bytes hashed
				48	return F(S, n)
gpike	a371645	2014-03-31 22:10:41 +0000	[diff] [blame]	49
				50	where M is a hash function that inputs and outputs c bits, and F is a
gpike	34c13dd	2015-03-01 20:04:23 +0000	[diff] [blame]	51	hash function that inputs c bits (plus, say, 64 for its second argument)
				52	and outputs however many bits one needs to return. In some sense we have
				53	reduced the string-hashing problem to two integer hashing problems.
gpike	a371645	2014-03-31 22:10:41 +0000	[diff] [blame]	54
				55	2.1 INTEGER HASHING TECHNIQUES
				56
				57	A hash function that inputs and outputs the same number of bits, say,
				58	32, can use reversible bit-twiddling operations, each of which is
				59	"onto" in the mathematical sense. For example, multiplication by an
				60	odd constant is reversible, as all odd numbers are relatively prime to
				61	2^32. Other commonly used reversible operations include:
				62	o Adding or xoring a constant
				63	o Bitwise rotation or other bitwise permutations
				64	o bit j = (bit j) xor (bit k) for unequal constants j and k
				65	o "Shift mix": S = S xor (S >> k), where k is, say, 17
				66	o Replacing a fixed-length bit string with its cyclic redundancy
				67	checksum, perhaps via _mm_crc32_u32(f, <some constant>) [Pike]
				68
				69	Each of the above is a "bad" hash function that inputs and outputs
				70	the same number of bits. One can simply compose two or more of those
				71	bad hash functions to construct a higher-quality hash function.
				72
				73	One common quality goal for integer hashing (and string hashing) is
				74	that flipping the 19th bit, or any other small change, applied to
				75	multiple input keys, causes a seemingly unpredictable difference each
				76	time. Similarly, any change to an input should lead to a seemingly
				77	unpredictable selection of the output bits to flip.
				78
				79	Therefore, if we want a high-quality hash function that inputs c bits
				80	and outputs fewer than c bits, we can simply truncate the output of a
				81	high-quality hash function that inputs and outputs c bits.
				82
				83	To give a concrete example, here is Bob Jenkins' mix(), published in
gpike	34c13dd	2015-03-01 20:04:23 +0000	[diff] [blame]	84	1996 [Jenkins]. Its input is 96 bits in three 32-bit variables, and its output
gpike	a371645	2014-03-31 22:10:41 +0000	[diff] [blame]	85	is 96 bits. However, one may use a subset of the output bits, as every
				86	output bit is affected by every non-empty subset of the input bits.
				87
				88	Input: a, b, and c
				89	Algorithm:
				90	a -= b; a -= c; a ^= (c>>13);
				91	b -= c; b -= a; b ^= (a<<8);
				92	c -= a; c -= b; c ^= (b>>13);
				93	a -= b; a -= c; a ^= (c>>12);
				94	b -= c; b -= a; b ^= (a<<16);
				95	c -= a; c -= b; c ^= (b>>5);
				96	a -= b; a -= c; a ^= (c>>3);
				97	b -= c; b -= a; b ^= (a<<10);
				98	c -= a; c -= b; c ^= (b>>15);
				99	Output: a, b, and c
				100
				101	2.2 VARIATIONS ON STRING HASHING
				102
				103	There are three variations on our initial sketch worth noting.
				104
				105	First, for speed, one can special-case short inputs, as the CityHash
				106	and FarmHash algorithms do. The number of special cases can be
				107	reduced by using loads that may overlap: for example, a hash of a 9-
				108	to 16-byte string can be implemented by a hash that inputs two 8-byte
				109	values (the first 8 and last 8 bytes of the input string) and the string
				110	length [CityHash, FarmHash].
				111
				112	Second, one may choose different means of incorporating input bits
				113	into the internal state. One example: the mixing of S and input bits
				114	may be interleaved with the mixing of parts of S and other parts of S.
				115	Another example: the input bits processed in a loop iteration might be
				116	xor'ed into multiple places in S, rather than just one, or might be
				117	hashed with each other before touching S [Murmur]. The advantages and
				118	disadvantages of these are unclear.
				119
				120	Third, one may repeatedly "squeeze information" from S, by remixing it with
				121	itself and then revealing a subset of S. This is convenient when one would
				122	like a family of hash functions with different output lengths. A special
				123	case of the idea, called the "sponge construction," has been well studied and
				124	adopted by the authors of Keccak and others [SHA-3].
				125
				126	3 HASH FUNCTIONS FOR HASH TABLES
				127
				128	It isn't hard to find real-life examples where hash tables or the hash
				129	functions for them take more than 5% of a program's CPU time.
				130	Improvements to hash tables and their hash functions are therefore a
				131	classic example of software performance tuning. Unfortunately, the
				132	best choice may be platform-dependent, so to avoid writing your own
				133	collection of #ifdefs, please consider selecting something like the
				134	FarmHash family of hash functions, that supply decent
				135	platform-dependent logic for you.
				136
				137	To tune a program, often one will replace an existing hash function with a
				138	faster, lower-quality hash function, despite the increased chance of unlucky
				139	or pathological performance problems. Clever algorithms can mitigate this
				140	risk. For example, hash tables can start with one hash function and then
				141	switch to another if things seem to be going poorly. Therefore, one should
				142	rarely plan to spend much CPU time on a secure hash function (such as SHA-3)
				143	or a near-universal hash function (such as VHASH) when seeking the best
				144	possible performance from a hash table. Against that, those types of hash
				145	functions can limit the risk of pathological performance problems when one is
				146	designing around typical hash-based algorithms that stick with a single hash
				147	function no matter how it behaves on the data at hand.
				148
				149	4
				150
				151	REFERENCES
				152
				153	[Murmur] Appleby, Austin. https://code.google.com/p/smhasher,
				154	https://sites.google.com/site/murmurhash/
				155	[SMHasher] Appleby, Austin. https://code.google.com/p/smhasher
				156	[SHA-3] Bertoni, Guido, et al. http://keccak.noekeon.org/
				157	[Jenkins] Jenkins, Bob. http://burtleburtle.net/bob/hash/doobs.html
				158	[VHASH] Krovetz, Ted. Message authentication on 64-bit architectures. In
				159	Selected Areas of Cryptography – SAC 2006. Springer-Verlag, 2006.
				160	[CityHash] Pike, Geoff and Alakuijala, Jyrki. https://code.google.com/p/cityhash
				161	[FarmHash] Pike, Geoff. https://code.google.com/p/farmhash
				162	[Pike] Pike, Geoff. http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf