The problem of finding frequent itemsets in data analysis is described in this post, and here i state the practical steps for finding the frequent itemsets thru MapReduce.
Let’s divide the file in which we want to find frequent itemsets into equal chunks randomly. Instead of using the entire file (data items in baskets), we could pick a random subset of the baskets and suppose it is the entire dataset. The risk in selecting a sample from one portion of a large file is that the data is not uniformly distributed in the file, so we will select from the source on a random basis.
Given:
- Each chunk is fraction ‘p’ of the whole input file (total 1/p chunks)
- ‘s’ is the support threshold for the algorithm
- ‘p*s’ or ‘ps’ is the threshold as we process a chunk, a sample.
Then after all the chunks have been processed in that way (Apriory algorithm), take the group of all the itemsets that have been found frequent for one or more chunks. Store all the frequent itemsets found for each chunk. Thus, every itemset that is frequent in the whole is frequent in at least one chunk, and we can be sure that all the truly frequent itemsets are among the candidates.
The Map-Reduce Algorithm fits well into a parallel-computing environment. Each of the chunks can be processed in parallel, and the frequent itemsets from each chunk are combined to form the candidates. We can distribute the candidates to many processors, have each processor count the support for each candidate in a subset of the baskets, and finally sum up those supports to get the support for each candidate itemset in the whole dataset. This process does not have to be implemented in map-reduce, but there is a natural way of expressing each of the two passes as a map-reduce operation.
Summary of the Map-Reduce implementing
First Map Function: Take the assigned subset of the baskets and find the itemsets frequent in the subset using the algorithm described above. As described there, lower the support threshold from ‘s’ to ‘ps’ if each Map task gets fraction ‘p’ of the total input file. The output is a set of key-value pairs (F, 1), where F is a frequent itemset from the sample. The value is always 1 and is irrelevant.
First Reduce Function: Each Reduce task is assigned a set of keys, which are itemsets. The value is ignored, and the Reduce task simply produces those keys (itemsets) that appear one or more times. Thus, the output of the first Reduce function is the candidate itemsets.
Second Map Function: The Map tasks for the second Map function take all the output from the first Reduce Function (the candidate itemsets) and a portion of the input data file. Each Map task counts the number of occurrences of each of the candidate itemsets among the baskets in the portion of the dataset that it was assigned. The output is a set of key-value pairs (C, v), where ‘C’ is one of the candidate sets and ‘v’ is the support for that itemset among the baskets that were input to this Map task.
Second Reduce Function: The Reduce tasks take the itemsets they are given as keys and sum up the associated values. The result is the total support for each of the itemsets that the Reduce task was assigned to handle. Those itemsets whose sum of values is at least ‘s’ are frequent in the whole dataset, so the Reduce task outputs these itemsets with their counts. Itemsets that do not have total support at least ‘s’ are not transmitted to the output of the Reduce task.
The whole process is described here in paragraph 6.4.