Bloom filter was introduced by Burton H. Bloom in his paper “Space/Time Trade-offs in Hash Coding with Allowable Errors” in 1970. Nowadays it is one of the most well-known probabilistic data structures. Bloom filter provides a space-efficient way to probabilistically check a membership of an element in a set.

Let’s say we need to check if 1 is in a set (1, 2, 3). If we would store 1, 2 and 3 in an array, the operation of checking a membership would be easy and straightforward: iterate over the array and see if the element in hand is equal to 1. Simple, right? However, what happens if we would need to see if a book exists in a set of 10,000 books. If a book is, let’s say, 1MB, then the array that would store all the books would be 10,000 MB or 10 GB. If we would like to perform the lookup on a machine that has 8GB of RAM then the machine would run out of memory. There would be 2 options: either upgrading the machine or finding smarter ways to solve the problem. Provided that we are happy with some false positives — i.e. situations where an element is reported to be in a set while it actually isn’t — then Bloom filter is a perfect data structure for this kind of problems.

## Basics

Bloom filter is defined by two integers: **m** and **k**. **m **indicates the length of the bit array used to store the information about the presence of all the elements. **k** indicates the number of hash functions used to add a single element to data structure. Essentially the **k** hash functions are used to determine which **k** bits in a bit array of length m are set to 1. When adding an element to an array we just specify the index **i** of the new element and the add operation just modifies the corresponding memory address at index **i**. The add operation of Bloom filter on the other hand iterates over the **k** hash functions where every hash function returns an index i that is less than m. After obtaining the **k** indexes, all the bits in **k** locations are set to 1. So instead of actually storing the elements themselves, Bloom filter just stores **m** bits.

In order to test the presence of an element in a set, as mentioned before, in the case of an array it would mean iterating over all the elements of an array an seeing if the current element in hand is equal to the element. In case of a Bloom filter the check for a presence of an element means iterating over the **k **hash functions that return **k** indexes that all are less than **m** and then seeing if all the **k** bits in the bit array are set to one. If any of the bits in the array are 0, then it is possible to conclude with 100% confidence that the element is not present in the set.

The probability side of the data structure comes to play when Bloom filter tells that a specific element is in the set. Let’s say that we added number 2 to the set which meant that bits at locations 3, 5 and 6 were set to 1. If we end up choosing the **k **hash functions such that hashing any other element, that’s not in the set, returns exactly the same indexes 3, 5 and 6 then we end up in a situation where Bloom filter reports that an element exists in the set even though it does not in reality. This means that we encountered a false positive. Even though false positives are not a happy scenario, it is possible to estimate the probability of a them.

## Estimating the probability of false positives

As mentioned before, Bloom filter might introduce some false positives for the sake of optimization which is not an ideal case. However, it is possible to reason about the probability using the following equation:

**n** in the equation indicates the number of elements inserted into the filter. The equation essentially tells us that by increasing the number of hash functions or the length of the bit array or by reducing the number of elements added to the filter we can reduce the probability of encountering a false positive.

## Implementation

Below is a potential implementation of a Bloom filter that initializes a bit vector of length **m **and with **k **different seeds for the hash function. The operations of adding and checking if an element is in the set means iterating over** k **seeds and calculating an corresponding **k** indexes** **using the hash function and the seeds. Appending an means flipping the bits in **k** different indexes, checking the presence means checking if all the bits at **k **different indexes are set to true.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import hashlib from bitarray import bitarray import random class BloomFilter: def __init__(self, k, m): self._seeds = [random.randint(0,10000) for x in range(k)] self._bit_array = m * bitarray(‘0’) def add(self, element): for seed in self._seeds: index = self._hash(element, seed) self._bit_array[index] = 1 def _hash(self, element, seed): hashed_object = hashlib.md5(‘{}{}’.format(element, seed)) return int(hashed_object.hexdigest()[:8],16) % len(self._bit_array) def exists(self, element): for seed in self._seeds: index = self._hash(element, seed) if self._bit_array[index] == 0: return False return True |