A dictionary is a data structure that stores key-value pairs. As we saw in the previous post and will keep exploring, dictionaries are behind many of Python’s features, including objects and namespaces. In this post, I do a deep dive in the underpinnings on Python’s dictionaries, and describe some useful features.

The inner workings of dictionaries (and sets)

Dictionaries and sets are ideal data structures to collect items when 1. the data has no intrinsic order; and 2. elements can be retrieved using keys, i.e., so-called hashable objects (further described below). The underpinnings of dictionaries and sets are very similar, a data structure called the hash map. Similarly to lists and tuples, we can visualize it as a finite number of memory buckets, each of which can store a reference to an object. The number of buckets is called the capacity. Each possible key maps univocally to a bucket, thanks to the hash function. Hence, lookups, insertions and deletions are performed in constant time. The difference between dictionaries and sets lies in what goes in the bucket: dictionaries store key-value pairs, while sets only store keys.

Hashing, explained

Hash maps rely on the key objects implementing a hash method (.__hash__()). This function maps each object to a fixed-byte integer. When we apply the hash() function to the object, its hash method gets called. The hash method needs to be fast, since all operations on a hash map are limited by its speed. In addition, it needs to be deterministic and produce fixed-length values. For reasons we will go over later, a hashable object needs to additionally implement either an __eq__ operator, or a __cmp__ operator.

Let’s see some examples using the immutable builtins:

hash(1)        # 1
hash(1.)       # 1
hash("1")      # 6333942777250828306

hash((1))      # 1
hash((1, 1))   # 8389048192121911274
hash((1., 1.)) # 8389048192121911274

Imagine that our hash map has a capacity of 16, i.e., we have 16 buckets indexed from 0 to 15. However, as we just saw, hashing can produce very large integers, which makes it impossible to use the integer as a bucket index. To keep things small, let’s say our object’s hash is 62. To map 62 to one of our 16 buckets, we need an additional operation called mask. A simple mask is the modulo function using the capacity, which will always produce a value between 0 and 15:

62 % 15
2

In other words, 62 would map to bucket number 2. However, there are faster masks. Specifically, Python uses the bitwise AND (&):

# bin(62) = 0b111110
#               &
# bin(15) = 0b001111
# ------------------
#           0b001110 = 14
62 & 15
14

Note that since the number of buckets will often be much smaller than the hash, & is effectively operating only on the tails.

Storing data into a bucket: handling collisions

If the number of empty buckets is large enough, a new key can be mapped to an empty bucket with high probability. (Empty buckets contain a NULL value.) In that case, we will simply store the key-value pair in the bucket.

However, it is possible that the bucket is already occupied. In that case, Python will first check if the keys are equal. That is why 1. we store the key object in the bucket; and 2. the key object needs to implement .__eq__ or .__cmp__. If there is a match, the key-value pair gets updated. Otherwise, we have a collision, i.e., two keys map to the same bucket. In that case, a deterministic exploration of other buckets starts. The details of this process, called probing, are beyond the scope of this article. Once an empty bucket is found, the key-value pair will be stored there. At lookup time, this probing process will be followed, until either the right key or an empty bucket is found.

When a value is deleted, we cannot simply overwrite it with a NULL. That would make the bucket identical to a “virgin” bucket, and potentially disrupt the probing strategy, leading to inconsistent results. Instead, a special value is written (sometimes called a “turd”). Nonetheless, this memory is not wasted: another key-value pair can take its place if needed, without compromising the integrity of the data.

Note: User-defined classes are hashable by default, even if they don’t implement __hash__, __eq__ or __cmp__. The hash is computed using the object’s id(), and all objects compare unequal.

Time complexity

Understanding what underlies dictionaries and sets allows us to estimate the time complexity of the different operations.

Lookup: given a key, in the vast majority of cases a lookup is done in \(O(1)\). The actual time depends on how fast the hash function is. Collisions are the main hurdle to lookup: in the worst case, all keys collide, meaning we have to iterate over all the elements to find ours. In that case, complexity is \(O(n)\). Luckily, a good hash function ensures that colisions are very rare.

Insertion: for similar reasons, the amortized time complexity is \(O(1)\) and the worst case is \(O(n)\).

Deletion: for similar reasons, the amortized time complexity is \(O(1)\) and the worst case is \(O(n)\).

Resizing: Python doubles the capacity of a dictionary when it becomes 2/3 full. Similarly, the capacity of a set gets quadrupled when it becomes 2/3 full. When such a thing happens, all key-value pairs need to be relocated into their new buckets. This is a pretty expensive step, albeit very infrequent, which keeps the amortized insertion complexity at \(O(1)\).

A dictionary implementation

Putting it all together, this is a very rough implementation of a dictionary:

from collections.abc import Hashable
from typing import Any

class Dictionary:
    """
    A dictionary implementation using linear probing for collision resolution.
    """

    def __init__(self, capacity: int = 1024) -> None:
        """
        Initialize the Dictionary with a given capacity.

        Parameters:
        - capacity (int): The initial capacity of the dictionary.

        Returns:
        - None
        """

        # virgin buckets are set to None, deleted buckets to False
        self.__buckets = [None for _ in range(capacity)]  
        self.__size = capacity
        self.__n_items = 0

    def __setitem__(self, key: Hashable, value: Any) -> None:
        """
        Set the value for a given key in the dictionary.

        Parameters:
        - key (Hashable): The key to be inserted.
        - value (Any): The value associated with the key.

        Returns:
        - None
        """
        idx = self.__find_key_bucket(key)
        self.__buckets[idx] = (key, value)
        self.__n_items += 1
        self.__resize_check()

    def __getitem__(self, key: Hashable) -> Any:
        """
        Get the value associated with a given key from the dictionary.

        Parameters:
        - key (Hashable): The key whose value needs to be retrieved.

        Returns:
        - Any: The value associated with the given key.

        Raises:
        - KeyError: If the key is not found in the dictionary.
        """
        idx = self.__find_key_bucket(key)
        return self.__buckets[idx][1]

    def __contains__(self, key: Hashable) -> bool:
        """
        Check if the dictionary contains a given key.

        Parameters:
        - key (Hashable): The key to check for existence in the dictionary.

        Returns:
        - bool: True if the key is present in the dictionary, False otherwise.
        """
        try:
            idx = self.__find_key_bucket(key)
        except KeyError:
            return False

        return bool(self.__buckets[idx])


    def __delitem__(self, key: Hashable) -> None:
        """
        Delete the entry for a given key from the dictionary.

        Parameters:
        - key (Hashable): The key to be deleted.

        Returns:
        - None

        Raises:
        - KeyError: If the key is not found in the dictionary.
        """
        idx = self.__find_key_bucket(key)
        self.__buckets[idx] = False
        self.__n_items -= 1

    def __resize_check(self) -> None:
        """
        Check if resizing of the dictionary is necessary based on the load factor (2/3).

        Returns:
        - None
        """
        if self.__n_items < (self.__size * 2 / 3):
            return

        new_size = 4 * self.__size
        dummy_dict = Dictionary(new_size)

        # reinsert all existing key-value pairs
        for bucket in self.__buckets:
            if not bucket:
                continue
            key, value = bucket
            dummy_dict[key] = value

        self.__size = new_size
        self.__buckets = dummy_dict.__buckets
        self.__n_items = dummy_dict.__n_items

    def __find_key_bucket(self, key: Hashable) -> int:
        """
        Find the index of the bucket corresponding to a given key.

        Parameters:
        - key (Hashable): The key to be found.

        Returns:
        - int: The index of the bucket.

        Raises:
        - KeyError: If the key is not found in the dictionary.
        """
        idx = hash(key) & (self.__size - 1)
        n_iters = 0

        # find an empty bucket (either None or False)
        while self.__buckets[idx]:

            # stop once we have checked all buckets
            if n_iters >= self.__size:
                raise KeyError

            if isinstance(self.__buckets[idx], tuple):
                if self.__buckets[idx][0] == key:
                    break

            idx += 1
            n_iters += 1
            if idx >= self.__size:
                idx = 0

        return idx

Let’s see it in action:

# I show under each command shows the the internal bucket state

my_dict = Dictionary(4)
# [None, None, None, None]

my_dict[1] = "a"
# [None, (1, 'a'), None, None]

my_dict[2] = "b"
# [None, (1, 'a'), (2, 'b'), None]

# 75% is full, which will trigger a resizing
my_dict[5] = "c"
# [None, (1, 'a'), (2, 'b'), None, None, (5, 'c'), None,
#  None, None, None, None, None, None, None, None, None]

my_dict[140] = "d"
# [None, (1, 'a'), (2, 'b'), None, None, (5, 'c'), None,
#  None, None, None, None, None, (140, 'd'), None, None, None]

my_dict[1] = 2
# [None, (1, 2), (2, 'b'), None, None, (5, 'c'), None, None,
#  None, None, None, None, (140, 'd'), None, None, None]

1 in my_dict # True

del my_dict[1]
# [None, False, (2, 'b'), None, None, (5, 'c'), None, None,
#  None, None, None, None, (140, 'd'), None, None, None]

1 in my_dict # False

my_dict[1] = "x"
# [None, (1, 'x'), (2, 'b'), None, None, (5, 'c'), None,
#  None, None, None, None, None, (140, 'd'), None, None, None]

my_dict[1] # x
my_dict[2] # b

Starred expressions

Dictionaries have two star operators: * and **. Let’s see how they work:

ingredients = {
    "carrots": 3,
    "tomatoes": 2,
    "lettuces": 1,
}

print({*ingredients})
{'tomatoes', 'carrots', 'lettuces'}

* unpacked the keys, which went into a set.

print({**ingredients})
{'carrots': 3, 'tomatoes': 2, 'lettuces': 1}

** unpacked the key-value pairs, which went into a new dictionary.

Insertion order is preserved

Since Python 3.6, Python dictionaries preserve insertion order, i.e., the items are printed in the same order in which they were inserted in the dictionary:

ingredients = {
    "carrots": 3,
    "tomatoes": 2,
    "lettuces": 1,
}
print(ingredients)
{'carrots': 3, 'tomatoes': 2, 'lettuces': 1}
ingredients = {
    "lettuces": 1,
    "tomatoes": 2,
    "carrots": 3,
}
print(ingredients)
{'lettuces': 1, 'tomatoes': 2, 'carrots': 3}

Merging two dictionaries

After Python 3.10, there are three ways of merging two dictionaries. Two of them are equivalent: unpacking using the ** operator and the merge operator |:

dairy_1 = {
    "cheese": 5,
    "yogurt": 4
}

dairy_2 = {
    "cheese": 3,
    "paneer": 2
}
{**dairy_1, **dairy_2}
{'cheese': 3, 'yogurt': 4, 'paneer': 2}
dairy_1 | dairy_2
{'cheese': 3, 'yogurt': 4, 'paneer': 2}

These options create a new dictionary with all the key-value pairs. As one might expect, the key insertion order is preserved from left to right. Note than when there are shared keys, the last value is kept.

An alternative is dict.update(), wich merges the dictionaries in place, updating the values when a key is shared:

dairy_1.update(dairy_2)
print(f"dairy_1 = {dairy_1}")
print(f"dairy_2 = {dairy_2}")
dairy_1 = {'cheese': 3, 'yogurt': 4, 'paneer': 2}
dairy_2 = {'cheese': 3, 'paneer': 2}

This is more memory efficient, since it does not create a new dictionary. However, it is not desirable if we want to keep the original dictionaries.

Handling default values

dict.setdefault to set and fetch values

The dict.setdefault method is useful to assign a value to a key if and only if the key is missing:

ingredients = {
    "carrots": 3,
    "tomatoes": 2,
    "lettuces": 1,
}

ingredients.setdefault("carrots", 0)
ingredients.setdefault("pineapples", 0)

print(f"Number of carrots: {ingredients['carrots']}")
print(f"Number of pineapples: {ingredients['pineapples']}")
Number of carrots: 3
Number of pineapples: 0

However, and despite its name, dict.setdefault will also fetch the value (either the preexisting one, or the newly created):

carrots = ingredients.setdefault("carrots", 0)

print(f"Number of carrots: {carrots}")
Number of carrots: 3

Use defaultdict when there is a single default value

The collections.defaultdict goes one step beyond. They are a good replacement for dictionaries when there is a unique default value. Its first argument is a function which returns the default value. It will be called if and only if the key is missing:

from collections import defaultdict

ingredients_dd = defaultdict(lambda: 0) 

for ingredient, amount in ingredients.items():
    ingredients_dd[ingredient] = amount

print(ingredients_dd)
print(f"Number of cabbages: {ingredients_dd['cabbages']}")
print(ingredients_dd)
defaultdict(<function <lambda> at 0x1014eb6d0>, {'carrots': 3, 'tomatoes': 2, 'lettuces': 1, 'pineapples': 0})
Number of cabbages: 0
defaultdict(<function <lambda> at 0x1014eb6d0>, {'carrots': 3, 'tomatoes': 2, 'lettuces': 1, 'pineapples': 0, 'cabbage': 0})

Note that the ingredients_dd contains an item for cabbages which was never explicitly inserted. defaultdict not only allows us to write simpler code, but is more efficient than setdefault to avoid unnecesary calls to the default factory. For instance, ingredients.setdefault("carrot", set()) would instantiate a new set even if the key carrot already exists; defaultdict would avoid that call.

Use collections.Counter to count

The collections.Counter is a type of dictionary specialized in counting objects, i.e., the values are integers. It can be initialized from an existing dictionary:

from collections import Counter

ingredients_counter = Counter(ingredients)
print(ingredients_counter)
Counter({'carrots': 3, 'tomatoes': 2, 'lettuces': 1, 'pineapples': 0})

By default missing keys have a value of 0, but they are not inserted:

print(f"Number of cabbages: {ingredients_counter['cabbage']}")
print(ingredients_counter)
Number of cabbages: 0
Counter({'carrots': 3, 'tomatoes': 2, 'lettuces': 1, 'pineapples': 0})

Counters extend dictionaries in interesting ways. For instance, they make it easy to find the elements with the most counts:

ingredients_counter.most_common(1)
[('carrots', 3)]

Composite keys

Since tuples are hashable objects, they can be used as keys:

# store ingredients and purchase date
ingredients = {
    ("carrots",  "2024-01-04"): 3,
    ("tomatoes", "2024-01-13"): 2,
    ("carrots", "2024-01-13"): 1,
}

Then, composite keys are used like this:

ingredients["carrots", "2024-01-13"]
1

Use zip to create dictionaries from lists

When we have two lists of the same length, we can quickly combine them using zip:

ingredient_list = ["carrots", "tomatoes", "lettuces"]
counts = [3, 2, 1]
ingredients = dict(zip(ingredient_list, counts))
print(ingredients)
{'carrots': 3, 'tomatoes': 2, 'lettuces': 1}

Further reading