In this paper, a multi label variant of CLUBAS [1] algorithm, ML-CLUBAS (Multi Label-Classification of software Bugs Using Bug Attribute Similarity) is presented. CLUBAS is a hybrid algorithm, and is designed by using text clustering, frequent term calculations and taxonomic terms mapping techniques, and is an example of classification using clustering technique. CLUBAS is a single label algorithm, where one bug cluster is exactly mapped to a single bug category. However a bug cluster can be mapped into the more than one bug category in case of cluster label matches with the more than one category term, for this purpose ML-CLUBAS a multi label variant of CLUBAS is presented in this work. The designed algorithm is evaluated using the performance parameters F-measures and accuracy, number of clusters and purity. These parameters are compared with the CLUBAS and other multi label text clustering algorithms.
Classification algorithms in text mining can be categorized using the class labels assigned to each category. Using class labels as classification parameters there are two major categories of classification algorithms, the first category is single label algorithms and the other category is multi label classification algorithms. In multilabel categorization a text document may belong to one or more number of categories. In single-label classification, each document is mapped to exactly one category. Software bugs contains most of the important information as text. The algorithm CLUBAS (CLassification of software Bugs Using Bug Attribute Similarity) is presented in [
In this section, the pseudo code and working of the ML-CLUBAS algorithm is presented. ML-CLUBAS is segmented into the five major steps just like CLUBAS. All the steps are same as CLUBAS except the step 4 (Mapping Clusters to Classes), which transforms the CLUBAS algorithm to ML-CLUBAS algorithm. Like CLUBAS, it takes two parameter for performing the bug classification i.e. textual similarity threshold value (t) and number of frequent terms in cluster label (N). The initial step in the Extract Data, where the bug records from a particular bug repository is retrieved and stored in the local system. The next step is Pre-processing Step, where the software bug records available locally in HTML or XML file formats are parsed and bug attributes and their corresponding values are stored in the local database. After this the stop words elimination and stemming is performed over the textual bug attributes summary (title) and description, which are used for creating the bug clusters. In the following step (Clustering), the pre-processed software bug attributes are selected for textual similarity measurement. Cosine similarity technique from symmetric [
The next step (Cluster Label Generation) is to generate the cluster labels using the frequent terms present in the bugs of a cluster. In this step the summary (title) and descriptions of all the software bugs belonging to a particular clusters are aggregated and frequent terms present in this aggregate text data is calculated and the N (where N is the number of frequent terms in labels and is an user supplied parameter) top most frequent terms are assigned to the clusters as the cluster labels. Mapping of the cluster labels to the bug categories using the taxonomic terms for various categories is carried out next (Mapping Clusters to Classes). In this step, the taxonomic terms for the entire bug categories are pre-identified and cluster label terms are matched with these terms. Matching of the terms indicates the belongingness of clusters to the categories. Here in ML-CLUBAS one bug cluster can belong to more than one bug category depending on the taxonomic term and cluster label match. On every match of these terms, the bug cluster can belong to the bug categories. The last step (Performance Evaluation and Output Representation) is generating the confusion matrix, using which various performance parameters like precision, recall, and accuracy is calculated. The precision and recall can be combined together to calculate f-measure, the formulas for these parameters is mentioned in the next section. Finally the cluster information is visualized and represented as the output of the ML-CLUBAS.
The accuracy and performance of prediction models for classification problem is typically evaluated using a confusion matrix. Various performance measures like accuracy and F-measure are derived from the confusion matrix. The formula’s for the parameters is covered in the CLUBAS [
Entropy is the amount of information by which the knowledge about the classes increases, when clusters are increased. Entropy tends to increase with the number of clusters, it reaches maximum to log2(N), where N is the number of clusters. Entropy is a measure of uncertainty associated with the random variables. Lower value of entropy indicates the better quality of clusters. In an ideal situation, if the software bug has a one to one mapping with a cluster, then the value of entropy will be zero [3,4]. Entropy is defined as follows:
where Pi is the probability of a document being in ith cluster.
Implementation is done using open source object oriented programming language Java, and MySql is taken as local data base management system, Weka [
The random software bug records are selected from four open sources online software bug repositories namely, Android [
After the software bug records are extracted and made available at local system, and then pre-processing of these records is performed. The pre-processing takes places in three stages: parsing, elimination of stop words and stemming [
The categorical terms are generated from the software bug clusters labels. The technique of generating these taxonomic terms from various bug repositories is given in [
The comparison of ML-CLUBAS is first performed with CLUBAS using the performance parameters accuracy and F-measure. Since up to cluster generation stage both CLUBAS and ML-CLUBAS are same, so same number of bug clusters are generated by both of these algorithms. ML-CLUBAS is further compared with the other multi label text clustering algorithms Lingo and STC using the parameters accuracy, F-measure, number of clusters and entropy. Lingo is proposed by Osinski et al. [10,11] for clustering search results, which uses the method of algebraic transformations of the term-document matrix and frequent phrase extraction using suffix arrays. Lingo is a popular web clustering algorithm and is commonly used for clustering the web search results. Grouper [12,13] is a snippet-based clustering engine. The main feature of Grouper is the introduction of a phrase-analysis algorithm called STC (Suffix Tree Clustering). The STC algorithm groups the input texts according to the identical phrases they share.
The result of the parameter accuracy for the algorithm and various repositories is shown in
the multi label text clustering algorithm is shown in figure. With increase in number of classes, drop in accuracy values is observed. Accuracy wise ML-CLUBAS performs much better than both Lingo and STC algorithms for all the bug repositories taken in experiment except JBoss-Seam repository. In case of JBoss-Seam bug repository STC gives higher accuracy than ML-CLUBAS and Lingo, the reason behind this is analyzed from manual section of JBoss-Seam repository. From manual inspection it is observed that JBoss-Seam bug repository consist of less textual information (less amount of text in textual attributes of bug) than the other bug repositories.
The relationship between the F-Measure and number of classes is plotted in
Number of clusters generated for different bug repositories is depicted in
increases with the number of software bugs. This is because as the number of bugs increases the more number of bugs are discovered which are not falling in any of the existing clusters, in other words the dissimilar bugs are entering into the system, which causes forming of the newer bug clusters.
Both CLUBAS and ML-CLUBAS creates same number of bug clusters, since up to the bug cluster step, the mechanism of the algorithms is same. After creation of bug clusters only the implementation of CLUBAS and ML-CLUBAS differs. Lingo creates maximum number of clusters (with less number of bugs in the clusters) for all the repositories than ML-CLUBAS and STC. STC always creates less and fixed number of clusters, because of its tree data structure. STC algorithm generates less number of clusters, up to 2000 bug samples it generates 16 (24) clusters because it follows a tree based structure to generate the clusters. Lingo generates more clusters than STC, but less than the CLUBAS algorithm for the same number of bugs. Lingo creates clusters by identifying key phrases in text, whereas CLUBAS generates clusters using textual similarity information in the text collection of software bug attributes. The reason behind the less number of clusters in Lingo and STC is more
number of software bugs are ignored in clusters and treated as outliers in Lingo and STC whereas less number of bugs is identified as outliers in CLUBAS. Around 7% bugs are identified as outliers in CLUBAS, whereas around 13% - 15% bugs are identified as outliers in Lingo and STC.
The graph plotted for the corresponding entropy values in
The computation time for creating the software bug classification using the text clustering algorithms (Lingo and STC) takes only five seconds up to 1000 software bugs records, whereas the algorithm ML-CLUBAS takes slightly higher computation time than Lingo and STC algorithms. The computation time calculated is around 20 seconds up to 1000 software bugs on the same machine using the same software bugs records. This is because measuring pair-wise attributes similarity and then applying clustering and label generation requires lot of calculations, which requires slightly more time than Lingo and STC algorithms. For 1000 bugs the maximum time taken is about 3.5 seconds and the maximum time taken for 100 bugs is about 1.2 seconds. The experiments are performed over a machine with CPU as 2.0 GHz and 2 GB of RAM (Random Access Memory).
The limitation of the work is same as with CLUBAS [
CLUBAS is a single label classification algorithm, where each bug cluster belongs to a single bug category. In this work, a multi label variant of CLUBAS, ML-CLUBAS is present with pseudo-code where a single bug cluster can be mapped to more than one bug category. The comparison of ML-CLUBAS with CLUBAS and other text clustering algorithm Lingo and STC is also presented. From results it is observed that since bug clusters are mapped to more than one category in ML-CLUBAS, which causes more values in TN (True Negative) and FP ( False Positive) and hence less accuracy and F-measure than CLUBAS. From the comparison with Lingo and STC, it is found that accuracy wise algorithm ML-CLUBAS performs better. From cluster entropy wise STC is the best algorithm, however ML-CLUBAS gives the acceptable entropy values.