Backward Support Computation Method for Positive and Negative Frequent Itemset Mining ()
1. Introduction
Association rules mining have been studied successfully and extensively in many application domains such as market analysis and telecommunications [1] [2] [3] [4]. As digitalized data are increasing due to human dependency on technologies, the world has an abundance of data which if processed properly can turn into one of the most valuable resources of information and knowledge. Specific methods and algorithms are required to extract knowledge from data, which is simply known as data mining. Along with classification and clustering, frequent pattern mining is one of the major sub-fields of data mining and knowledge discovery. Purpose of frequent pattern mining is to obtain itemsets from data based on common associations. It can be useful to find patterns that frequently appear in data or to discover items often existing together. Applications of frequent pattern mining range from supermarket sales analysis to bioinformatics.
Algorithms like Apriori [5], FP-Growth [6] can mine what is referred to in the literature as positive association rules (PAR). To mine positive associations from a data set the items with strong positive correlations are considered. However, itemsets with strong negative correlation called negative association rule (NAR) may also be as informative as positive ones. For example, from supermarket sales data NAR mining may reveal products scarcely bought together. In behavior analysis lack of specific behaviors may help understand human psychology.
Formally a positive association rule is of the form
where X and Y are separate entities. The rule
can be crudely translated into if X exists in an itemset and Y also exists in it. For NARs, either one or both of the related entities are negated. i)
, ii)
or iii)
are the possible negative associations, whereas rules like
says if X does not exist in an itemset and Y exists in it. An example can be Pepsi
ØCoke, because people who buy one brand of cold-drinks from a store are less likely to buy another along with it.
Negative relations among items were first mentioned in [7], but the first significant work on it was [8] which combined previously discovered positive associations with domain knowledge to reduce the search space. Reducing the search space enabled it to discover fewer but more interesting negative associations. There have been few other notable works in the domain. In [9] authors defined four types of association rules from Piatetsky-Shapiro’s argument and used statistical dependency and conditional probability to discover association rules. In 2006, an algorithm named PNAR based on the Apriori method was proposed [10]. The algorithm exploits the upward closure property of negative association rules to find all valid positive-negative associations in the support-confidence framework.
The same year, authors in [11] presented an approach for mining positive and negative association rules which used chi-squared test and multiple confidences. Combining the Multiple Level Minimum Support (MLMS) model with a new measure called Valid Association Rules based on Correlation-coefficient and Confidence (VARCC) which combines correlation coefficient and minimum confidence a new algorithm (PNAR_MLMS) was proposed in [12]. Authors in [13] proposed an algorithm using correlation coefficient with support and confidence to mine both PFIS and NFIS. The coefficients need continuous updating while processing.
Because both positive and negative association rules are relevant, many algorithms focus on finding both simultaneously. The task of discovering (negative or positive) association rules can be divided into two parts. The first one is to mine frequent itemsets like A, B, C or A, B, ØC, the latter is to generate association rules from them. The process of generating association rules mostly depends on the discovery of frequent itemsets and generally follows the same method everywhere. Frequent itemset extraction is what usually defines the difference in performance between two algorithms. However, the hidden nature of negative itemsets makes the mining task very complex. Authors often define their own definitions and parameters in their work. Besides, negative frequent itemsets (NFIS) are numerous in number. Even for a considerably small data set the search space becomes very large. Most NAR mining algorithms are therefore time-consuming.
This work proposes EBC-NFIS based on e-NFIS algorithm [14] that can discover NFIS from pre-calculated positive frequent itemsets (PFIS). It is clear that each NFIS is nothing but a recombination of a PFIS where one or more items are negated. e-NFIS first generates all possible negative itemsets from the permutation of positive itemsets. Then, it uses the inclusion-exclusion principles to calculate itemset support. This work attempts to speed up the second step of e-NFIS with an aim to reduce overall computation time. EBC-NFIS processes negative candidate itemsets in a systematic order to calculate supports of current step based on supports calculated in previous steps. This reduces large computation time of e-NFIS algorithm and determines itemset support in constant runtime instead of exponential. The experimental results demonstrate that EBC-NFIS performs significantly faster than e-NFIS.
The rest of the paper is organized as follows. Section 2 presents problem definition and theories related to this work. Section 3 and Section 4 present the proposed EBC-NFIS algorithm and experimental results respectively. Finally Section 5 concludes this paper with some future works.
2. Preliminaries
In this section, we first define the problem of backward support computation for EBC-NFIS. Then, we present generation of PFIS and NCIS along with working principle of conventional e-NFIS algorithm.
2.1. Problem Definition
Let
be a set of n distinct terms called positive items,
be a set of I’s corresponding non-occurring items called negative items. Each positive item has its own negative counterpart. Similarly, each negative item has its positive counterpart. Here,
is called the positive partner of
, and
is called the negative partner of I.
Let
be a transactional database where each transaction t contains items from the set of positive items I such that t is a subset of I i.e.
. Each transaction is associated with a unique identifier, called
. A non-empty subset of
is called an itemset. The number of items in an itemset X is the length of the itemset, denoted by
. An itemset with length k is presented as k-itemset. We define
as the number of transactions containing itemset X in transactional database, D. An association rule for itemsets X and Y is defined as
where
and
. X and Y are called antecedent and consequent of the association rule respectively.
Association rules have some standard measures that present quality and significance of the association. In this work, we employed two popular measures namely support and confidence.
● Support: The most popular framework for eliminating non-interesting itemsets is support calculation. The support of an itemset can be thought of as the probability of it being in a randomly chosen transaction. Mathematically the support of an item
can be written as,
(1)
Here
is the size of the transactional database D. In the support framework an itemset is considered to be interesting if its support is above a given threshold value called minimum support which is expressed as
in this work. While mining frequent itemsets, an itemset
will be eliminated if
.
● Confidence: The confidence of an association rule is an indication of how often the rule has been found to be true. Confidence represents the percentage of transactions in database D containing X that also contain Y and denoted as follows:
(2)
The confidence value indicates how reliable the rule is. The higher the confidence value, the more likely the rules occur in a database if it is known that all rules are contained in that database. An association rule is considered to be interesting when its confidence value is greater than user defined minimum confidence. The minimum confidence is represented as
throughout this paper.
Association rules having the support value greater than user defined minimum support,
, and confidence greater than user defined minimum confidence,
are called valid association rules. Therefore, by a valid association rule, we mean any expression of the form
where,
,
,
,
and
1)
2)
2.2. Generation of Positive Frequent Itemsets (PFIS)
PFIS can be derived using an algorithm like Apriori [5] or FP-Growth [6]. Apriori is a level based algorithm which first generates candidate itemsets and prunes non-interesting ones. FP-growth replaces the data set with a tree like data structure called FP-Growth tree and skips the candidate generation part entirely. Unlike Apriori, FP-Growth algorithm does not scan the whole data set multiple times and therefore performs significantly better for large data sets. Other notable positive frequent pattern mining algorithms include rapid association rule mining (RARM) [15] and ECLAT [16]. Like FP-Growth, RARM uses a novel tree data structure and can accelerate the mining process even for low minimum support. ECLAT uses a vertical database format. In this work, we employed Apriori algorithm to derive FPIS from transactional database.
2.3. Generation of Negative Candidate Itemset (NCIS)
NCIS can be generated from FPIS by permutation and negation of items. For example, given a positive itemset
the following negative itemsets can be produced:
●
●
●
●
●
●
●
The conventional e-NFIS algorithm ignores itemsets where each member is negative. This work however considers even them as interesting enough to discover.
2.4. Support Calculation for NCIS
In this work, we employed support calculation method presented in [17] for negative itemsets. The basic idea is inclusion-exclusion principle of set theory.
Let, I be the set of all n items in a transactional database D. Suppose, we have a candidate negative itemset,
where
and
. We can also write
and
. Therefore,
and its support can be calculated as below:
(3)
For example, the support of an itemset
can be calculated as,
2.5. Working Principle of e-NFIS
The e-NFIS algorithm is widely used in negative itemset mining that generates NFIS from PFIS instead of scanning whole database [14]. The algorithm works in three steps:
1) Step1: Mine PFIS using traditional PFIS algorithms, such as Apriori or FP-Growth.
2) Step2: Generate NCIS from PFIS derived in Step1. Generation of NCIS from PFIS is described in 2.3.
3) Step3: Calculate support for NCIS. Supports of NCIS can be calculated using inclusion-exclusion principle as described in 2.4.
In should be noted that as NCIS is generated from FPIS, any subset of PFIS used in the support calculation must also be frequent and therefore will exist among the discovered itemsets. We employed e-NFIS algorithm with backward calculation of NCIS to derive NFIS in this work.
3. Proposed Backward Support Computation Algorithm
In this section, we describe proposed method of support computation for NCIS using backward calculations. Here, we define an itemset m-negative itemset if it contains exactly m number of negative items. For example,
is a 2-negative itemset. Computation of NCIS support increases with increasing value of m i.e. starting from 1-negative itemsets, then 2-negative itemsets and so on. Let, a candidate negative itemset
. An itemset
can be generated by excluding the last element
. The support of newly generated
can be computed if the support of W is known as follows:
(4)
For example, the support of an itemset
can be calculated as:
(5)
All items from the power set of
are required in computation. As
and
, the time complexity of the support calculation step of e-NFIS is
where q is the number of negative items in itemset NIC. Sacrificing space for time, the proposed EBC-NFIS can perform this step in constant or O(1) time because addition of only two values is required. Achieving from exponential to constant runtime is a huge improvement in performance. As the size of the data set increases naturally, the number of total items should also increase. Therefore, conventional e-NFIS runs significantly slow for those cases, where the performance of EBC-NFIS will not deteriorate. The proposed EBC-NFIS algorithm is summarized in Algorithm 1.
Algorithm 1. Proposed backward support calculation for EBC-NFIS.
4. Experimental Results
In this section, we present experimental results of EBC-NFIS algorithm for different datasets. First, we briefly describe hardware and software platforms and give a brief description of standard datasets used for the experiments. Then, we present experimental results and evaluate efficiency of proposed EBC-NFIS algorithm by comparing with e-NFIS algorithme [14].
4.1. System and Data Set Description
All the experiments have been performed on a system with software summarized in Table 1. In this work, we used three standard datasets namely Breast Cancer dataset, Mushroom dataset and Congressional voting 84. The datasets are collected from [18] that is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas. Table 2 summarizes the number of transactions and number of items of both the datasets.
Figure 1. Positive and negative association rules for different
and
for breast cancer dataset.
4.2. Result for Different Minimum Supports and Confidence
We conducted this experiment to find the number of positive and negative association rules (PNARs) for different minimum support,
and confidence,
using proposed EBC-NFIS algorithm. The experiment results based on the data sets of breast cancer, mushroom and congressional voting 84 are shown in Figures 1-3 respectively. The experimental results indicate that the number of PNARs found by proposed algorithm increases for all the datasets when minimum support,
decreases. Similarly, the number of PNARs increases when minimum confidence,
decreases for all the datasets which is a fair characteristics of association rules searching algorithm.
4.3. Comparison between e-NFIS and EBC-NFIS
In this section, we evaluated proposed EBC-NFIS algorithm by comparing running time with conventional e-NFIS algorithm. We run both e-NFIS and EBC-NFIS on both breast cancer and mushroom data set to observe the difference in runtime for both the algorithms. As e-NFIS algorithm finds only NFIS, we consider running time only for finding NFIS for both the algorithms for a fair comparison. The results have been displayed in Figures 4-6.
The running time shows that the proposed backward calculation method requires significantly less running time than that of e-NFIS. As minimum support decreases, the difference becomes more prominent. It is evident from the result that proposed method can discover NFIS far more efficiently than e-NFIS algorithm and consumes much less time. The difference in performance becomes even more noticeable when the number of possible frequent patterns increases. Bigger amount of patterns is usually associated with more distinct items and therefore number of negative items in an itemset.
Figure 2. Positive and negative association rules for different
and
for mushroom dataset.
Figure 3. Positive and negative association rules for different
and
for congressional voting 84 dataset.
Figure 4. Comparison of runtime between e-NFIS and EBC-NFIS for different minimum supports for breast cancer dataset.
Figure 5. Comparison of runtime between e-NFIS and proposed EBC-NFIS for different minimum supports for mushroom dataset.
Figure 6. Comparison of runtime between e-NFIS and proposed EBC-NFIS for different minimum supports for congressional voting 84 dataset.
5. Conclusion
In this paper, we proposed an algorithm EBC-NFIS to discover PNARs and NFIS from positive ones. The algorithm is based on popular algorithm called e-NFIS. The proposed EBC-NFIS algorithm mines positive frequent itemsets using FP-Growth algorithm and generates candidate negative itemsets. Candidate negative itemsets are filtered based on a support threshold. We utilize calculated values of previous step to speed up calculation of current step. Experimental results showed that EBC-NFIS calculates support exponentially faster than e-NFIS which validate efficiency of our proposal. However, EBC-NFIS consumes higher memory compared to e-NFIS. Reduction in memory consumption will be our future research.