Entropy-Learned Hashing Constant Time Hashing with Controllable Uniformity


B. Hentschel, U. Sirin, and S. Idreos, “Entropy-Learned Hashing Constant Time Hashing with Controllable Uniformity,” in ACM SIGMOD International Conference on Management of Data, 2022.
entropylearnedhashing.pdf1.14 MB


Hashing is a widely used technique for creating uniformly random numbers from arbitrary input data. It is a core component in relational data systems, key-value stores, compilers, networks and many more areas used for a wide range of operations including indexing, partitioning, filters, and sketches. Due to both the computational and data heavy nature of hashing in such operations, numerous recent studies observe that hashing emerges as a core bottleneck in modern systems. For example, a typical complex database query (TPC-H) could spend 50% of its total cost in hash tables, while Google spends at least 2% of its total computational cost across all systems on C++ hash tables, resulting in a massive yearly footprint coming from just a single operation.

In this paper we propose a new method, called Entropy-Learned Hashing, which reduces the computational cost of hashing by up to an order of magnitude. The key question we ask is “how much randomness is needed?”: We look at hashing from a pseudorandom point of view, wherein hashing is viewed as extracting randomness from a data source to create random outputs and we show that state-of-the-art hash functions do too much work. Entropy-Learned Hashing 1) models and estimates the randomness (entropy) of the input data, and then 2) creates data-specific hash functions that use only the parts of the data that are needed to differentiate the outputs. Thus the resulting hash functions can minimize the amount of computation needed while we prove that they act similarly to traditional hash functions in terms of the uniformity of their outputs. We test Entropy-Learned Hashing across diverse and core hashing operations such as hash tables, Bloom filters, and partitioning and we observe an increase in throughput in the order of 3.7X, 4.0X, and 14X respectively compared to the best in-class hash functions and implementations used at scale by Google and Meta.