CS 100.9.1: Hash Tables and Hashing

Posted by AdamT213 on May 16, 2018

Ten posts in, we have reached the point where, following the pattern, this post should be called “CS101.0”! However, since this series is not nearly in-depth enough, nor am I nearly knowledgeable enough, to write a CS 101 course, we will call it CS 100.9.1. Don’t think that this means the content herein is, in any way, an order of magnitude less important than what we have covered thus far. In fact, hash tables are extremely foundational data structures, and probably should have been covered earlier in this series. See, I don’t know ANYTHING?! (Also, notice how I opted to go for 100.9.1 instead of 100.91, mirroring the way software versions are released. Can anyone say META?!)

Okay, on to hash tables. You are likely familar with whatever implementation of hash tables your language of choice implements: dictionaries in Python, hashes in Ruby, objects in JavaScript, etc. Let’s take the Python naming convention as an example, since it is the most indicative. A hash table is just that: a dictionary. When you think about what a dictionary really is, it is a list of key-value pairs, where looking up a key maps to a particular value. This can be particularly useful in solving problems where multiple pieces of information need to be known at every step, perhaps even some problems you would have thought to use an array (or several arrays) for. When used properly, hash tables can greatly reduce the time complexity of various functions, and can make code easier to understand.

First, an overview of hashing. I mentioned that each key is “mapped” to a value. Of course, this does not happen automatically, but instead, the computer accomplishes this by way of a hash function. Understanding how to write a good hash function is beyond the scope of what most interviewers will expect of a junior developer, but it is not a bad thing to have some familiarity with, even if you can’t write perfect code for it.

The problem with adding keys to a hash table is that keys can be anything. In arrays, it is simple to add items and check whether items are there; they are stored at an integer index, and we can check for an item at that index. Hash keys are not necessarily integers, so we need a way to standardize them and map them to a consistent set of possible indices. This is where hashing comes in. A hash function takes a particular key and converts it to a representation that can be stored in the table. Hash functions must always map the same key to the same location, and they must ensure that all equal keys map to the same location (in other words, 9.0, the float, and 9, the integer, must map to the same location in memory). The problem here is that, when creating the hash table, we do not always know what keys and kinds of keys we will need to store. What is more, hash tables are not memory-efficient data structures, and can quickly grow unwieldy if we allot to much space for keys to be mapped to. The struggle with hash tables is finding a balance between allotting too much space and making sure that unique keys map to unique locations. When two different keys map to the same location, this is known as a “collision”, and it must be handled by the hash function. This, then, is the fundamental struggle of hash functions: map keys to as many different locations as possible without taking up excessive memory, minimize collisions, and when they do occur, handle them as efficiently as possible.

Going back to our Python dictionary example, Python’s dictionary hash function works by taking the 32-bit binary representation of each key, mapping to the last three bits in that key, and if there is a collision at that location, introducing earlier bits until an emtpy slot is found. When 2/3 of the slots in the table are occupied, the table resizes (4x for a table <50,000 entries long, 2x for a table >50,000). This, however, is not the most common way to handle collisions. Far more typical is using “chaining”, wherein, a key is mapped to a location, and if there is a collision there, a linked list is created at that location, with each item in the list pointing to the next key that maps to that location. In any case, the bottom line is that hash tables are efficient precisely because we can write hash functions that optimize space efficiency while minimizing the number of collisions, creating a situation where, in the average case, an entry in the table is no more than 1-2 collisions deep, and the space required to make this happen is not substantially more than the number of items in the list. In fact, when used properly, a hash table with a good hash function can find a given key and its value, insert a key, and delete a key, all in O(1) time in the average case.

It is important to realize that when hashing with chaining, in the worst case, and indeed, this is true for all hasing schemes: the lookup, insertion, and deletion times will be O(n). This is because, at worst, all items will hash to the same location, and we will be left with an n-length list of items. In practice, we can make this incredibly unlikely, and the way to do so is one of the most common and thus far undiscussed algorithm strategies in CS: randomization. The idea is that, with n keys to be hashed, each key is mapped to a random location, and thus each key is equally likely to be mapped to any location. In practice, this assumption does not hold up for various reasons, but don’t worry about that, and take it to be true. If this is true, then supposing that we have n keys and a table of length m, the expected length of each chain is n/m, since for each slot in the table, each key has a 1/m chance of hashing to it, and there are n keys. The expected chain length is referred to as the “load factor” of the table. If there are twice as many keys as slots in the table, the load factor will be two. If n is 10 times m, the load factor will be 10, and so on. This is not important. What is important is that these are constants, so, the cost of performing an action on the table will always be O(1+n/m), and thus will be constant.

The tricky part, then, is writing a hash function that, as closely as possible, simulates the assumption we made that each key is equally likely to map to any given location, independent of where other keys map. Again, this assumption, known as simple uniform hashing, is a theoretical ideal that we can never achieve in practice, but we can approximate it as closely as possible, and this is where randomization comes into play. We call this algorithm universal hashing, and it looks like this:

h(k) = [(ak + b) mod p] mod m

Obviously, h(k) is the hash function for a particular key. p is a prime number larger than the universe of possible keys you are hashing, so for example, if you are making a table of bones in the human body, p is a prime number >206. a and b are random numbers between 0 and p. In practice, this function creates a situation where the likelihood of two non-equal keys hashing to the same location is 1/m in the worst case, meaning that our theoretical ideal is essentially met.

Again, understanding universal hashing is way beyond what most interviewers will expect of you, but I figured at least having some exposure to it could only help you in your quest. Hopefully, having some sense of how hash tables work under the hood helps elucidate why they are so useful in solving problems. Indeed, they are some of the most common data structures in CS applications, and now that you know how they work, I urge you to begin thinking about when you can use them to simplify problems. In general, in situations where you want to make use of relationships between multiple data points, hash tables are often your best bet.

I will leave you today with some wise words from a presenter at PyCon 2010 by the name of Brandon Craig Rhodes

“May your hashes be unique, your hash tables never full, and may your keys rarely collide”

Thanks for reading! Til next time.