To start with, let's say we have a simple list of distinct integers (change it if you want - the page will update):
Python lists are actually arrays — contiguous chunks of memory. The name "list" may be misleading to people who know about double-linked lists but are unfamiliar with Python. You can picture a Python list as a row of slots, where each slot can hold a Python object:
To check if an element is present in a list, we can use the
in operator like this:
number in simple_list, which returns either
False. Under the hood this short snippet does a linear scan. This can be a lot of work. To see this, let's reimplement it in Python.
Let's say we're looking for the following number in the original list:
(The visualization is interactive. The buttons allow you to step through the code. Notice here that the time slider is draggable - feel free to rewind the time or move it forward. Also, feel free to mess with the input and the original list - the visualization will update automatically)
What's not so great about linear scans? If we have a million distinct numbers, in the worst case scenario, we may need to scan the whole list. But scanning over a few elements is no big deal. We need to have some order and predictability to make the search fast. We need to have some idea of where the searched element is located.
A Python dict implementation is basically a scan of a list (but a pretty weird scan). We'll build the actual algorithm and data structure inside Python dictionary step by step, starting with the code above, which is intentionally verbose.
Chapter 1: searching efficiently in a list
A Python dict is a collection of key-value pairs. And, the most important part of it is handling keys. Keys need to be organized in such a way that efficient searching, inserting and deleting is possible.
In this chapter, to keep things simple, we won't have any values, and "keys" will just be plain integers. So, the simplified problem is to check if a number is present in a list, but we have to do this fast. We'll tackle the real problem in the following chapters. But for now, bear with me.
Accessing a single element by index is very fast. Accessing only a few elements would be fast too. We don't want to be doing a linear scan over the whole list every time we look up a number, so we need to organize our data in a clever way.
Let's begin by creating a new list of slots. Each slot will either hold a number from the original list or be empty (empty slots will hold
None). We'll use the number itself to compute an index of a slot. The simplest way to do this is to take the remainder of
number divided by
number % len(the_list) and put our number in slot with this index. To check if the number is there we could compute the slot index again and see if it is empty.
Would this approach work, however? Not entirely. For example,
50 will get the same slot index (
2, and it will be overwritten. Situations like these are called collisions.
To make this approach viable, we need to somehow resolve collisions. Let's do the following: if the slot is already occupied by some other number, we'll just check the slot that comes right after it. And if that slot is empty, we'll put the number there. But, what if that slot is also occupied? Once again, we'll go ahead and check the next slot. We'll keep repeating this process until we finally hit an empty slot. This process is called probing. And because we do it linearly, it is called linear probing. In code, we would write this as
(idx + 1) % len(simple_list), so it wraps around back to the beginning at the last index:
If we make the new list the same size as the original list, we'll have too many collisions. If we make it 10x larger, we'll have very few collisions, but we'll waste a lot of memory. So what size should it be? We want to hit the sweet spot where we don't use up too much memory but also don't have too many collisions. Twice the size of the original list is reasonable.
Let's transform the original list using this method (when reading this code, keep in mind that
original_list is a list of distinct numbers, so we don't need to handle duplicates just yet).
To search for a number, we retrace all the steps necessary to insert it: we start from the slot
number % len(new_list) and do linear probing. We either end up finding the number or hitting an empty slot. The latter situation means that the number is not present.
Calculating an index based on the value of the number and resolving collisions by linear probing is an incredibly powerful idea. What we've just implemented is a simple hash table (more about the term in the next chapter). Python dict uses a hash table internally, albeit a more complex variant.
We still haven't discussed adding more elements (what happens if a table overflows?), removing elements (removing an element without a trace would cause a hole to appear, wouldn't this cause the search algorithm to stop prematurely in many cases?), and perhaps most importantly, handling objects other than integers - strings, tuples, floats. We'll do this in the next chapters.
Collision resolution via separate chaining
There is a different method of collision resolution, called separate chaining. It is also a good strategy which is commonly used. But that's not how Python resolves collision in dicts, so it is beyond the scope of this explanation.
A couple of the notes about the explanation
First, this explanation discusses
dict as it is implemented in CPython — the "default" and most common implementation of the Python language (if you are not sure what implementation you use, it is almost certainly CPython). Some other implementations are PyPy, Jython and IronPython. The way dicts works in each of these implementations may be similar to CPython (in the case of PyPy) or very different from CPython (in the case of Jython).
Second, even though dict in CPython is implemented in C, this explanation uses Python for code snippets. The goal of this page is to help you understand the algorithms and the underlying data structures, not the minutiae of the C code (these details are interesting too - they are just are beyond the scope of this explanation).