Introduction to Bloom filter

Read­ing Time: 3 min­utes

Bloom fil­ter was intro­duced by Bur­ton H. Bloom in his paper “Space/Time Trade-offs in Hash Cod­ing with Allow­able Errors” in 1970. Nowa­days it is one of the most well-known prob­a­bilis­tic data struc­tures. Bloom fil­ter pro­vides a space-effi­cient way to prob­a­bilis­ti­cal­ly check a mem­ber­ship of an ele­ment in a set.

Let’s say we need to check if 1 is in a set (1, 2, 3). If we would store 1, 2 and 3 in an array, the oper­a­tion of check­ing a mem­ber­ship would be easy and straight­for­ward: iter­ate over the array and see if the ele­ment in hand is equal to 1. Sim­ple, right? How­ev­er, what hap­pens if we would need to see if a book exists in a set of 10,000 books. If a book is, let’s say, 1MB, then the array that would store all the books would be 10,000 MB or 10 GB. If we would like to per­form the lookup on a machine that has 8GB of RAM then the machine would run out of mem­o­ry. There would be 2 options: either upgrad­ing the machine or find­ing smarter ways to solve the prob­lem. Pro­vid­ed that we are hap­py with some false pos­i­tives — i.e. sit­u­a­tions where an ele­ment is report­ed to be in a set while it actu­al­ly isn’t — then Bloom fil­ter is a per­fect data struc­ture for this kind of prob­lems.

Basics

Bloom fil­ter is defined by two inte­gers: m and kindi­cates the length of the bit array used to store the infor­ma­tion about the pres­ence of all the ele­ments. k indi­cates the num­ber of hash func­tions used to add a sin­gle ele­ment to data struc­ture. Essen­tial­ly the k hash func­tions are used to deter­mine which k bits in a bit array of length m are set to 1. When adding an ele­ment to an array we just spec­i­fy the index i of the new ele­ment and the add oper­a­tion just mod­i­fies the cor­re­spond­ing mem­o­ry address at index i. The add oper­a­tion of Bloom fil­ter on the oth­er hand iter­ates over the k hash func­tions where every hash func­tion returns an index i that is less than m. After obtain­ing the k index­es, all the bits in k loca­tions are set to 1. So instead of actu­al­ly stor­ing the ele­ments them­selves, Bloom fil­ter just stores m bits.

In order to test the pres­ence of an ele­ment in a set, as men­tioned before, in the case of an array it would mean iter­at­ing over all the ele­ments of an array an see­ing if the cur­rent ele­ment in hand is equal to the ele­ment. In case of a Bloom fil­ter the check for a pres­ence of an ele­ment means iter­at­ing over the hash func­tions that return k index­es that all are less than m and then see­ing if all the k bits in the bit array are set to one. If any of the bits in the array are 0, then it is pos­si­ble to con­clude with 100% con­fi­dence that the ele­ment is not present in the set.

The prob­a­bil­i­ty side of the data struc­ture comes to play when Bloom fil­ter tells that a spe­cif­ic ele­ment is in the set. Let’s say that we added num­ber 2 to the set which meant that bits at loca­tions 3, 5 and 6 were set to 1. If we end up choos­ing the hash func­tions such that hash­ing any oth­er ele­ment, that’s not in the set, returns exact­ly the same index­es 3, 5 and 6 then we end up in a sit­u­a­tion where Bloom fil­ter reports that an ele­ment exists in the set even though it does not in real­i­ty. This means that we encoun­tered a false pos­i­tive. Even though false pos­i­tives are not a hap­py sce­nario, it is pos­si­ble to esti­mate the prob­a­bil­i­ty of a them.

Estimating the probability of false positives

As men­tioned before, Bloom fil­ter might intro­duce some false pos­i­tives for the sake of opti­miza­tion which is not an ide­al case. How­ev­er, it is pos­si­ble to rea­son about the prob­a­bil­i­ty using the fol­low­ing equa­tion:

    \[(1-(1-\dfrac{1}{m})^{kn})^k\approx(1-e^{\dfrac{kn}{m}}))^k\]

 n in the equa­tion indi­cates the num­ber of ele­ments insert­ed into the fil­ter. The equa­tion essen­tial­ly tells us that by increas­ing the num­ber of hash func­tions or the length of the bit array or by reduc­ing the num­ber of ele­ments added to the fil­ter we can reduce the prob­a­bil­i­ty of encoun­ter­ing a false pos­i­tive.

Implementation

Below is a poten­tial imple­men­ta­tion of a Bloom fil­ter that ini­tial­izes a bit vec­tor of length and with dif­fer­ent seeds for the hash func­tion. The oper­a­tions of adding and check­ing if an ele­ment is in the set means iter­at­ing over k seeds and cal­cu­lat­ing an cor­re­spond­ing k index­es using the hash func­tion and the seeds. Append­ing an means flip­ping the bits in k dif­fer­ent index­es, check­ing the pres­ence means check­ing if all the bits at dif­fer­ent index­es are set to true.