Introduction to Bloom filter

Standard

Bloom fil­ter was intro­duced by Bur­ton H. Bloom in his paper “Space/Time Trade-offs in Hash Cod­ing with Allow­able Errors” in 1970. Nowa­days it is one of the most well-known prob­a­bilis­tic data struc­tures. Bloom fil­ter pro­vides a space-effi­cient way to prob­a­bilis­ti­cal­ly check a mem­ber­ship of an ele­ment in a set.

Let’s say we need to check if 1 is in a set (1, 2, 3). If we would store 1, 2 and 3 in an array, the oper­a­tion of check­ing a mem­ber­ship would be easy and straight­for­ward: iter­ate over the array and see if the ele­ment in hand is equal to 1. Sim­ple, right? How­ev­er, what hap­pens if we would need to see if a book exists in a set of 10,000 books. If a book is, let’s say, 1MB, then the array that would store all the books would be 10,000 MB or 10 GB. If we would like to per­form the lookup on a machine that has 8GB of RAM then the machine would run out of mem­o­ry. There would be 2 options: either upgrad­ing the machine or find­ing smarter ways to solve the prob­lem. Pro­vid­ed that we are hap­py with some false pos­i­tives — i.e. sit­u­a­tions where an ele­ment is report­ed to be in a set while it actu­al­ly isn’t — then Bloom fil­ter is a per­fect data struc­ture for this kind of prob­lems.

Basics

Bloom fil­ter is defined by two inte­gers: m and kindi­cates the length of the bit array used to store the infor­ma­tion about the pres­ence of all the ele­ments. k indi­cates the num­ber of hash func­tions used to add a sin­gle ele­ment to data struc­ture. Essen­tial­ly the k hash func­tions are used to deter­mine which k bits in a bit array of length m are set to 1. When adding an ele­ment to an array we just spec­i­fy the index i of the new ele­ment and the add oper­a­tion just mod­i­fies the cor­re­spond­ing mem­o­ry address at index i. The add oper­a­tion of Bloom fil­ter on the oth­er hand iter­ates over the k hash func­tions where every hash func­tion returns an index i that is less than m. After obtain­ing the k index­es, all the bits in k loca­tions are set to 1. So instead of actu­al­ly stor­ing the ele­ments them­selves, Bloom fil­ter just stores m bits.

In order to test the pres­ence of an ele­ment in a set, as men­tioned before, in the case of an array it would mean iter­at­ing over all the ele­ments of an array an see­ing if the cur­rent ele­ment in hand is equal to the ele­ment. In case of a Bloom fil­ter the check for a pres­ence of an ele­ment means iter­at­ing over the hash func­tions that return k index­es that all are less than m and then see­ing if all the k bits in the bit array are set to one. If any of the bits in the array are 0, then it is pos­si­ble to con­clude with 100% con­fi­dence that the ele­ment is not present in the set.

The prob­a­bil­i­ty side of the data struc­ture comes to play when Bloom fil­ter tells that a spe­cif­ic ele­ment is in the set. Let’s say that we added num­ber 2 to the set which meant that bits at loca­tions 3, 5 and 6 were set to 1. If we end up choos­ing the hash func­tions such that hash­ing any oth­er ele­ment, that’s not in the set, returns exact­ly the same index­es 3, 5 and 6 then we end up in a sit­u­a­tion where Bloom fil­ter reports that an ele­ment exists in the set even though it does not in real­i­ty. This means that we encoun­tered a false pos­i­tive. Even though false pos­i­tives are not a hap­py sce­nario, it is pos­si­ble to esti­mate the prob­a­bil­i­ty of a them.

Estimating the probability of false positives

As men­tioned before, Bloom fil­ter might intro­duce some false pos­i­tives for the sake of opti­miza­tion which is not an ide­al case. How­ev­er, it is pos­si­ble to rea­son about the prob­a­bil­i­ty using the fol­low­ing equa­tion:

    \[(1-(1-\dfrac{1}{m})^{kn})^k\approx(1-e^{\dfrac{kn}{m}}))^k\]

 n in the equa­tion indi­cates the num­ber of ele­ments insert­ed into the fil­ter. The equa­tion essen­tial­ly tells us that by increas­ing the num­ber of hash func­tions or the length of the bit array or by reduc­ing the num­ber of ele­ments added to the fil­ter we can reduce the prob­a­bil­i­ty of encoun­ter­ing a false pos­i­tive.

Implementation

Below is a poten­tial imple­men­ta­tion of a Bloom fil­ter that ini­tial­izes a bit vec­tor of length and with dif­fer­ent seeds for the hash func­tion. The oper­a­tions of adding and check­ing if an ele­ment is in the set means iter­at­ing over k seeds and cal­cu­lat­ing an cor­re­spond­ing k index­es using the hash func­tion and the seeds. Append­ing an means flip­ping the bits in k dif­fer­ent index­es, check­ing the pres­ence means check­ing if all the bits at dif­fer­ent index­es are set to true.

Introduction to Content Security Policy

Standard

Con­tent secu­ri­ty pol­i­cy (CSP) is an extra lay­er of secu­ri­ty that helps to pro­tect web­sites from cross-site script­ing (XSS). The core idea of CSP is to enforce a set of rules described either in the head­er of a HTTP response or in the HTML head ele­ment. Using CSP in com­bi­na­tion with tra­di­tion­al meth­ods such as input san­itza­tion and out­put encod­ing is a great way to enhance the secu­ri­ty of a web­site. The stan­dard is gov­erned by W3 Web Appli­ca­tion Secu­ri­ty Work­ing Group and the ver­sion 2 is cur­rent­ly the most wide­spread with a large major­i­ty of browsers sup­port­ing it1.

There are two ways to enable CSP: either via intro­duct­ing HTTP response head­er Con­tent-Secu­ri­ty-Pol­i­cy that defines the pol­i­cy or describ­ing it in the meta tag of a HTML doc­u­ment. In case of using HTML tags, the tag has to be placed as ear­ly to the doc­u­ment as pos­si­ble because all the resources before the tag will be loaded and parsed with­out any restric­tions and there’s a pos­si­bil­i­ty that one of these resources might tam­per the pol­i­cy described in the tag. If both the response head­er and the meta tag are described, then the stricter pol­i­cy is tak­en into account. In gen­er­al it seems to be more rea­son­able to describe CSP rules in the HTTP head­ers as it is more straight­for­ward to use and offers report-only func­tion­al­i­ty described lat­er.
A poten­tial CSP rule set could look like this:

The pol­i­cy con­sists of rules sep­a­rat­ed by semi­colons where each rule describes the tar­get (whether it applies to scripts, images, stylesheets etc.) and the per­mit­ted sources. For exam­ple, the rule below per­mits to load scripts only from the domain where the web­site is host­ed.

The table below describes the most impor­tant tar­gets of CSP.

Tar­getExpla­na­tion
block-all-mixed-con­tentblocks load­ing any assets over HTTP when the page is loaded via HTTPS
con­nect-srcrestricts URLs which can be loaded in scripts. For exam­ple, dis­ables ille­gal XHRs.
default-srcA default pol­i­cy used if a direc­tive is miss­ing. Should be used always.
font-srcSpec­i­fies cor­rect sources for fonts
script-srcSpec­i­fies cor­rect sources for scripts
report-uriURI to send reports of vio­la­tion of a pol­i­cy
style-srcSpec­i­fies cor­rect sources for styles

RuleDescrip­tion
‘self’Match­es the cur­rent domain
‘unsafe-inline’Allows exe­cut­ing inline CSS and JS
https://example.comAllows resources only from example.com over HTTPS
https://*.example.comAllows resources from all sub­do­mains of example.com
https://example.com:*Allows resources only from example.com over HTTPS using any port

As it can be seen from the table above, CSP allows using wild­card syn­tax, which enables to cre­ate fine-grained set­up in order to avoid acci­den­tal secu­ri­ty holes. As always with secu­ri­ty, it is wise to be as restric­tive as pos­si­ble.

Inline scripts

CSP is meant to pro­tect a site from adver­saries, while per­mit­ting the real author as much free­dom as pos­si­ble. A quite com­mon sce­nario is that an own­er of a web­site wants to add an inline script like Google Ana­lyt­ics, but block every­one else from inject­ing inline resources. How to achieve it con­sid­er­ing that not describ­ing rule ‘unsafe-inline’ blocks even non-mali­cious inline scripts? CSP spec­i­fi­ca­tion allows the usage of nonce, where the pol­i­cy spec­i­fies an allowed nonce and the per­mit­ted inline resource tag has to have an attribute called “nonce” with exact­ly the same val­ue.
For exam­ple, spec­i­fy­ing the fol­low­ing pol­i­cy allows inject­ing scripts with a nonce val­ue of EDNnf03nceIOfn39fn3e9h3sdfa.

A pos­si­ble inline script that is per­mit­ted to exe­cute could be like this:

As the attack­er does not know the nonce (because nonce, by def­i­n­i­tion, can be used only once and has to be gen­er­at­ed on every page load), then the mali­cious inline resources won’t be exe­cut­ed by the brows­er.

Reporting

CSP has a  unique fea­ture which per­mits send­ing vio­la­tion reports to a  URL in case of an attempt to load a for­bid­den resource. In order to spec­i­fy the URL, CSP sep­ci­fies direc­tive report-uri fol­lowed by the URL.

The reports are sent via POST request, where the request body could be some­thing like this:

In addi­tion to block­ing and report­ing, CSP also pro­vides a report-only mode where noth­ing is blocked, but all the banned sources are report­ed. This is a per­fect way to test CSP before actu­al­ly start­ing to block ille­gal sources or even use it as hon­ey­pot to learn the behav­ior of mali­cious users. Adding HTTP head­er Con­tent-Secu­ri­ty-Pol­i­cy-Report-Only to the response enables the report-only mode. Both the Con­tent-Secu­ri­ty-Pol­i­cy and Con­tent-Secu­ri­ty-Pol­i­cy-Report-Only can be used at the same time. How­ev­er, it is not pos­si­ble to enable report-only mode using HTML meta tags.

Conclusion

CSP is a great addi­tion­al lay­er of secu­ri­ty that enables to whitelist the sources of assets that are allowed to exe­cute. Pro­vid­ed that the rules are prop­er­ly con­fig­ured, then the attack­ers should have hard time inject­ing their own sources to the site. As always, it is bet­ter to be safe than sor­ry and con­fig­ur­ing CSP as strict­ly as pos­si­ble is a good idea. In addi­tion, even though CSP helps us to secure web­sites, just rely­ing only on it is too risky and the oth­er stan­dard meth­ods like san­i­tiz­ing input and escap­ing out­put should also be used.