The Magical Bloom Filter
What is Bloom Filter?
Bloom Filter checks if a key exists or not. It was invented by Burton Bloom (and hence the name) to solve dictionary problem. The Bloom Filter is used to check if the word is in dictionary or not. Computer memory is very limited at that time, and it is impossible to load all the words into memory.
The uniqueness of Bloom Filter is that it is not a very accurate filter. If it reports false (not exist), the items definitely is not exist. However if it reports true (exist), there are a small chance that it could be false positive result.
Comparison with normal filter
To illustrate the benefit of Bloom Filter, we use a normal filter to do comparison. These filter basically do two things, store keys, and check if particular string exists or not.
public interface IFilter { public void addData(String key); public boolean isExist(String key); }
Snippet 1: IFilter interface contains only 2 operations.
public class NormalFilter implements IFilter { private HashSet keySet = new HashSet(); @Override public void addData(String key) { keySet.add(key); } @Override public boolean isExist(String key) { return keySet.contains(key); } }
Snippet 2: Normal Filter store each key, thus it size will increase with the number of keys.
public class BloomFilter implements IFilter { private int bloomBitArray; @Override public void addData(String key) { int hashKey = hash(key); System.out.println("bloomBitArray (before):\t" + toBinaryString(bloomBitArray)); System.out.println("hashKey:\t\t" + toBinaryString(hashKey)); bloomBitArray = hashKey | bloomBitArray; System.out.println("bloomBitArray (after):\t" + toBinaryString(bloomBitArray)); } @Override public boolean isExist(String key) { int hashKey = hash(key); System.out.println("bloomBitArray:\t" + toBinaryString(bloomBitArray)); System.out.println("hashKey:\t" + toBinaryString(hashKey)); return (bloomBitArray & hashKey) == hashKey; } // it can be 1 or more hash functions private int hash(String key) { // you can use your own hash code, even as simple as modulo hash code, // I use Java built-in hashcode to illustrate the purpose. // set up 1 bit in the bloomBitArray return 1 << key.hashCode(); } // beautify the result private String toBinaryString(int key) { return padLeft(Integer.toBinaryString(key)); } // beautify the result private String padLeft(String binaryString) { return String.format("%32s", binaryString).replace(" ", "0"); } }
Snippet 3: Bloom Filter doesn’t store any keys. It only stores combined hash values.
public class Main { private static IFilter filter = new BloomFilter(); //private static IFilter filter = new NormalFilter(); /** * @param args */ public static void main(String[] args) { // add as much data as you want here here printAddData("hello"); printAddData("I"); printAddData("am"); printAddData("happy"); System.out.println("=================================================="); System.out.println(" RESULT "); System.out.println("=================================================="); // testing printIsExist("world"); printIsExist("wo"); printIsExist("w"); printIsExist("happy"); printIsExist("hap"); printIsExist("I"); } private static void printAddData(String key) { System.out.println("---------------------------------"); System.out.println("addData(" + key + ")"); filter.addData(key); } private static void printIsExist(String key) { System.out.println("---------------------------------"); System.out.println("isExist(" + key + "):\t" + filter.isExist(key)); } }
Results
--------------------------------- addData(hello) bloomBitArray (before): 00000000000000000000000000000000 hashKey: d 00000000000001000000000000000000 bloomBitArray (after): 00000000000001000000000000000000 --------------------------------- addData(I) bloomBitArray (before): 00000000000001000000000000000000 hashKey: d 00000000000000000000001000000000 bloomBitArray (after): 00000000000001000000001000000000 --------------------------------- addData(am) bloomBitArray (before): 00000000000001000000001000000000 hashKey: d 00000000000000000001000000000000 bloomBitArray (after): 00000000000001000001001000000000 --------------------------------- addData(happy) bloomBitArray (before): 00000000000001000001001000000000 hashKey: d 00000000000000000000000000000001 bloomBitArray (after): 00000000000001000001001000000001 ================================================== RESULT ================================================== --------------------------------- bloomBitArray: 00000000000001000001001000000001 hashKey: 00000000000001000000000000000000 isExist(world): true --------------------------------- bloomBitArray: 00000000000001000001001000000001 hashKey: 00000001000000000000000000000000 isExist(wo): false --------------------------------- bloomBitArray: 00000000000001000001001000000001 hashKey: 00000000100000000000000000000000 isExist(w): false --------------------------------- bloomBitArray: 00000000000001000001001000000001 hashKey: 00000000000000000000000000000001 isExist(happy): true --------------------------------- bloomBitArray: 00000000000001000001001000000001 hashKey: 00000000100000000000000000000000 isExist(hap): false --------------------------------- bloomBitArray: 00000000000001000001001000000001 hashKey: 00000000000000000000001000000000 isExist(I): true
Snippet 5: Bloom Filter Test Result
Note that the first result “world” is not correct. This is the “uniqueness” Bloom Filter (Yea, when we love something, we will call whatever defects as uniqueness :D).
So the “special” character of Bloom Filter is:
- When it say “wrong”, it is definitely wrong.
- When it say “right”, it could be right, or it could be wrong (false positive).
As the result, we can trust the result when it says “something doesn’t exist”. If it reports “yes, it exists”, we will have to check in other way to confirm the result. It seems like problematic, but in fact it is not, because Bloom Filter is normally used as an add-on to speed up query. It doesn’t stand alone.
Benefits
Size does matter
As you can see, Bloom Filter stores only hash values. Each has values will be “OR”ed with current bit array in Bloom Filter, as the result, the size of Bloom Filter remains constant, regardless the number of input.
Increase query speed
Several NoSQL datastores, namely HBase and Cassandra, have use it to increase query speed.
Assuming checking with Bloom Filter takes 1 ms, and querying with database takes 100 ms.
We have 3 cases here.
- Case 1: If Bloom Filter reports that the item doesn’t exist, the operation takes 1 ms.
- Case 2: If Bloom filter reports that the item exists, the request will be sent to database, and database will return the data, the operation takes 1 + 100 ms = 101 ms
- Case 3: If Bloom filter reports that the item exists, the request will be sent to database, and database will report not found, the operation takes 1 + 100 ms = 101 ms
Source code: blog-bloomfiter.zip