Blame - Understanding_Hash_Functions - chromium.googlesource.com/external/github.com/google/farmhash

blob: 19091cb273d5a354324047ba6d1cc6801d776b27 [file] [log] [blame]

gpike	a371645	2014-03-31 22:10:41 +0000	[diff] [blame^]	1	UNDERSTANDING HASH FUNCTIONS
				2	by Geoff Pike
				3
				4	Version 0.1 --- early draft --- comments and questions welcome!
				5	References appear in square brackets.
				6
				7	1 INTRODUCTION
				8
				9	Hashing has proven tremendously useful in constructing various fast
				10	data structures and algorithms. It is typically possible to simplify
				11	the analysis of hash-based algorithms if one assumes that the relevant
				12	hash functions are high quality. At the other extreme, if the
				13	relevant hash functions were always to return the same value, many
				14	hash-based algorithms become algorithms that are slower, simpler, but still well-known.
				15	For example, a chaining hash table devolves into a linked list.
				16
				17	There are many possible definitions of hash function quality. For
				18	example, one might want a list of keys and their hashes to provide no
				19	pattern that would allow an opponent to predict anything about the
				20	hashes of other keys. Although I cannot prove it, I think I can meet
				21	this and many other definitions of quality with
				22
				23	f(s) = SHA-3(concatenation of z and s),
				24
				25	where z is some secret string known only to me. This well-known trick
				26	provides, I think, more high-quality hash functions than anyone will
				27	need, though greater computational power in the future may push us to
				28	replace SHA-3 from time to time.
				29
				30	In short, discussions about choosing a hash function are almost always
				31	discussions about speed, energy consumption, or similar. Concerns
				32	about hash quality are easy to fix, for a price.
				33
				34	2 ANATOMY OF A HASH FUNCTION
				35
				36	Hash functions that input strings of arbitrary length are written in
				37	terms of an internal state, S. In many cases the internal state is a
				38	fixed number of bits and will fit in machine registers. One generic
				39	sketch of a string hash is:
				40
				41	let S = some initial value
				42	let c = the length of S in bits
				43	while (input is not exhausted) {
				44	let t = the next c bits of input (padded with zeroes if less than c remain)
				45	S = M(S xor t)
				46	}
				47	return F(S)
				48
				49	where M is a hash function that inputs and outputs c bits, and F is a
				50	hash function that inputs c bits and outputs however many bits one
				51	needs to return. In some sense we have reduced the string-hashing
				52	problem to two integer hashing problems.
				53
				54	2.1 INTEGER HASHING TECHNIQUES
				55
				56	A hash function that inputs and outputs the same number of bits, say,
				57	32, can use reversible bit-twiddling operations, each of which is
				58	"onto" in the mathematical sense. For example, multiplication by an
				59	odd constant is reversible, as all odd numbers are relatively prime to
				60	2^32. Other commonly used reversible operations include:
				61	o Adding or xoring a constant
				62	o Bitwise rotation or other bitwise permutations
				63	o bit j = (bit j) xor (bit k) for unequal constants j and k
				64	o "Shift mix": S = S xor (S >> k), where k is, say, 17
				65	o Replacing a fixed-length bit string with its cyclic redundancy
				66	checksum, perhaps via _mm_crc32_u32(f, <some constant>) [Pike]
				67
				68	Each of the above is a "bad" hash function that inputs and outputs
				69	the same number of bits. One can simply compose two or more of those
				70	bad hash functions to construct a higher-quality hash function.
				71
				72	One common quality goal for integer hashing (and string hashing) is
				73	that flipping the 19th bit, or any other small change, applied to
				74	multiple input keys, causes a seemingly unpredictable difference each
				75	time. Similarly, any change to an input should lead to a seemingly
				76	unpredictable selection of the output bits to flip.
				77
				78	Therefore, if we want a high-quality hash function that inputs c bits
				79	and outputs fewer than c bits, we can simply truncate the output of a
				80	high-quality hash function that inputs and outputs c bits.
				81
				82	To give a concrete example, here is Bob Jenkins' mix(), published in
				83	1996 [Jenkins]. It's input is 96 bits in three 32-bit variables, and its output
				84	is 96 bits. However, one may use a subset of the output bits, as every
				85	output bit is affected by every non-empty subset of the input bits.
				86
				87	Input: a, b, and c
				88	Algorithm:
				89	a -= b; a -= c; a ^= (c>>13);
				90	b -= c; b -= a; b ^= (a<<8);
				91	c -= a; c -= b; c ^= (b>>13);
				92	a -= b; a -= c; a ^= (c>>12);
				93	b -= c; b -= a; b ^= (a<<16);
				94	c -= a; c -= b; c ^= (b>>5);
				95	a -= b; a -= c; a ^= (c>>3);
				96	b -= c; b -= a; b ^= (a<<10);
				97	c -= a; c -= b; c ^= (b>>15);
				98	Output: a, b, and c
				99
				100	2.2 VARIATIONS ON STRING HASHING
				101
				102	There are three variations on our initial sketch worth noting.
				103
				104	First, for speed, one can special-case short inputs, as the CityHash
				105	and FarmHash algorithms do. The number of special cases can be
				106	reduced by using loads that may overlap: for example, a hash of a 9-
				107	to 16-byte string can be implemented by a hash that inputs two 8-byte
				108	values (the first 8 and last 8 bytes of the input string) and the string
				109	length [CityHash, FarmHash].
				110
				111	Second, one may choose different means of incorporating input bits
				112	into the internal state. One example: the mixing of S and input bits
				113	may be interleaved with the mixing of parts of S and other parts of S.
				114	Another example: the input bits processed in a loop iteration might be
				115	xor'ed into multiple places in S, rather than just one, or might be
				116	hashed with each other before touching S [Murmur]. The advantages and
				117	disadvantages of these are unclear.
				118
				119	Third, one may repeatedly "squeeze information" from S, by remixing it with
				120	itself and then revealing a subset of S. This is convenient when one would
				121	like a family of hash functions with different output lengths. A special
				122	case of the idea, called the "sponge construction," has been well studied and
				123	adopted by the authors of Keccak and others [SHA-3].
				124
				125	3 HASH FUNCTIONS FOR HASH TABLES
				126
				127	It isn't hard to find real-life examples where hash tables or the hash
				128	functions for them take more than 5% of a program's CPU time.
				129	Improvements to hash tables and their hash functions are therefore a
				130	classic example of software performance tuning. Unfortunately, the
				131	best choice may be platform-dependent, so to avoid writing your own
				132	collection of #ifdefs, please consider selecting something like the
				133	FarmHash family of hash functions, that supply decent
				134	platform-dependent logic for you.
				135
				136	To tune a program, often one will replace an existing hash function with a
				137	faster, lower-quality hash function, despite the increased chance of unlucky
				138	or pathological performance problems. Clever algorithms can mitigate this
				139	risk. For example, hash tables can start with one hash function and then
				140	switch to another if things seem to be going poorly. Therefore, one should
				141	rarely plan to spend much CPU time on a secure hash function (such as SHA-3)
				142	or a near-universal hash function (such as VHASH) when seeking the best
				143	possible performance from a hash table. Against that, those types of hash
				144	functions can limit the risk of pathological performance problems when one is
				145	designing around typical hash-based algorithms that stick with a single hash
				146	function no matter how it behaves on the data at hand.
				147
				148	4
				149
				150	REFERENCES
				151
				152	[Murmur] Appleby, Austin. https://code.google.com/p/smhasher,
				153	https://sites.google.com/site/murmurhash/
				154	[SMHasher] Appleby, Austin. https://code.google.com/p/smhasher
				155	[SHA-3] Bertoni, Guido, et al. http://keccak.noekeon.org/
				156	[Jenkins] Jenkins, Bob. http://burtleburtle.net/bob/hash/doobs.html
				157	[VHASH] Krovetz, Ted. Message authentication on 64-bit architectures. In
				158	Selected Areas of Cryptography – SAC 2006. Springer-Verlag, 2006.
				159	[CityHash] Pike, Geoff and Alakuijala, Jyrki. https://code.google.com/p/cityhash
				160	[FarmHash] Pike, Geoff. https://code.google.com/p/farmhash
				161	[Pike] Pike, Geoff. http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf