BigData: Bloom filter

Bloom filters are probabilistic data structures that are commonly used in big data analysis to efficiently test whether an element is a member of a set. They are particularly useful when dealing with large sets of data, here are the general steps to create a Bloom filter in big data analysis:

Determine the size and number of hash functions: The size of the Bloom filter should be proportional to the number of elements in the set you are testing for membership. The number of hash functions should be determined based on the acceptable false positive rate, which is the probability that a non-member is incorrectly identified as a member.
Initialize the Bloom filter: Create an array of bits of the appropriate size and initialize all bits to 0.
Add elements to the Bloom filter: For each element in the set, apply the hash functions to the element to generate a set of hash values. Set the corresponding bits in the Bloom filter to 1 for each hash value.
Test for membership: To test whether an element is a member of the set, apply the same hash functions to the element and check if all corresponding bits in the Bloom filter are set to 1. If any of the bits are 0, the element is not a member. If all of the bits are 1, the element may be a member with a certain probability.
Optimize the Bloom filter: Optimize the Bloom filter by considering the size of the set, the desired false positive rate, and the available memory. Adjust the size and number of hash functions to achieve the desired trade-off between accuracy and memory usage.

Overall, creating a Bloom filter in big data analysis involves determining the appropriate size and number of hash functions, initializing the Bloom filter, adding elements to the filter, testing for membership, and optimizing the filter for accuracy and memory usage. Bloom filters are a powerful tool for efficiently testing membership in large sets of data in big data analysis.

Here's an example of how a Bloom filter can be used to check for membership in a set:

Suppose we have a set of strings {apple, banana, cherry, durian} and we want to create a Bloom filter to test for membership in this set. We'll use a bit array of size 16 and two hash functions.

First, we initialize the bit array to all 0s:

|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|

To add the string "banana" to the set, we apply our hash functions to the string and set the corresponding bits in the bit array:

hash function 1("banana") = 1

hash function 2("banana") = 12

So we set bits 1 and 12 to 1:

|0|1|0|0|0|0|0|0|0|0|0|1|0|0|0|0|

To test if the string "apple" is in the set, we apply our hash functions to the string and check the corresponding bits in the bit array:

hash function 1("apple") = 0

hash function 2("apple") = 9

Bits 0 and 9 are both 0, so we can safely say that "apple" is not in the set.

To test if the string "cherry" is in the set, we apply our hash functions to the string and check the corresponding bits in the bit array:

hash function 1("cherry") = 11

hash function 2("cherry") = 8

Bits 11 and 8 are both 0, so we can safely say that "cherry" is not in the set.

To test if the string "durian" is in the set, we apply our hash functions to the string and check the corresponding bits in the bit array:

hash function 1("durian") = 3

hash function 2("durian") = 12

Bits 3 and 12 are both 1, so we conclude that "durian" is likely in the set (with a small probability of a false positive).

Note :

For hash function 1, we can take the hash of a string by summing the ASCII values of its characters and then taking the result modulo the size of the bit array (16 in this case). Here's how we can calculate hash function 1("banana"):

hash function 1("banana")

= (98 + 97 + 110 + 97 + 110 + 97) % 16

= 609 % 16

= 1

For hash function 2, we can take the hash of a string by multiplying each ASCII value by a different prime number, summing the results, and then taking the result modulo the size of the bit array. Here's how we can calculate hash function 2("banana"):

hash function 2("banana")

= ((98 * 2) + (97 * 3) + (110 * 5) + (97 * 7) + (110 * 11) + (97 * 13)) % 16

= 196+291+550+679+1210+1261 % 16

= 4187 % 16

= 12

In practice, more sophisticated hash functions are typically used to improve the quality of the hash and reduce the probability of collisions.

Example :

BigData

Pages

Sunday, 2 April 2023

Bloom filter

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Report Abuse

Labels