Bloom Filters- An Intro
Bloom Filters are quite simply put space-efficient, probabilistic data structures which help answer the question of set membership. From a mathematical lens say that you need to verify whether an element x belongs to a set A , if the Bloom Filter as a data structure returns true for x, what this means is that either x is definitely in the set A or it may not be in the set A. An implication that follow from this is that Bloom Filters are quite efficient at preprocessing and filtering out large lists, for further detailed set operations such as joins and intersections.
The next questions which arise are how can we create Bloom Filters ? What exactly is involved in Bloom Filter probabilistic data structures ?
For a start let us consider a set S such that:
Let B be a bit array of size m (m > 1), initialized with 0s. B’s elements are B[0], B[1], B[2], …, B[m–1]. The memory required for storing array B is only a fraction of that needed for storing the whole set S. By selecting the bigger bit vector (array B), it is possible to decrease the probability rate of false positives.
Let
- {H1, H2, …, Hk} be a set of k hash functions. If Hi(xj) = a, then set B[a] = 1. SHA1, MD5, and Murmur as hash functions can be used as hash functions.
• To check if x ε S, check B at Hi(x). All k values must be 1.
- It is possible to have a false positive; all k values are 1, but x is not in S.
Applications of Bloom Filters
Since Bloom Filters are a space efficient data structure they have a wide array of applications across various fields. For example, a Bloom Filter is used under the hood in the Apache Cassandra distributed database to facilitate read requests to the database. Some of the earliest applications of Bloom Filters include usage in the early development for spell checkers on a dictionary. Some of the more modern applications that they are used for include a variety of use cases varying from usage in network routers and applications to bioinformatics to blockchain implementations.