In Business Intelligence (and in data mining in general) a regular need is to be able to find the items that frequently go together in a consumer basket.
A common itemset problem is to mine a set of items to find a subset of those which have a close connection between them. A good example are the bottles of ketchup that are often located beside spaghetti shelves in some grocery stores. Those items appear together frequently in a consumer basket. Any set of items that regularly appear together in many baskets is said to be frequent.
When mining those data, we could use this algorithm to define and then recommend items of purchase for a buyer (consider some popups that appear after you complete an online purchase). Knowing that A and B are a frequent itemset, why not recommend item B to a user who buys item A. With a bit of data mining and further application of these frequent data sets in marketing, the total revenue has increased up to 30%.
The challenge in this problem is the amount of data to be processed in memory and engines load. Having a set of N items per basket, there would be N!/2!(N-2)! pair combinations of items to check. First it would have to store all these combinations for all baskets and second to iterate through all of them to find the frequent pairs.
The Apriori algorithm helps
The Apriori algorithm is that “if a pair of items is to be frequent, each individual item should also be frequent“. Following the Apriori algorithm, the number of sets will be greatly reduced resulting in the reduction of computation resources as well.
Solution to find frequent sets
To establish a “threshold” we must define what is or is not frequent in the algorithm. Having an item appear more than “threshold”= X times in baskets, defines it as frequent.
Apriori algorithm implementation
- Pass over each item in each basket, and calculate its frequency (count how many time each one appears). This can be done with a hash, where the position of the hash, refers to the frequency ‘Y’. If an item has a frequency greater than X, it is considered to be frequent. Y (position) > X (threshold).
- Process in order to iterate only through the individually frequent items, computing the frequency of pairs of those items. So if item 1 and item 2 are frequent in themselves, we then compute the frequency of the pair. Once this is calculated, the frequencies greater than the threshold are said to be a frequent itemset.
Having used the Apriori algorithm, even if 50% of the items below threshold are sifted out, the computation load decreases as well as the memory consumption by a factor of 4.