Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques: A Comparative Study

,


Introduction
A software defect is a bug, fault, or error in a program that causes improper outcomes.Software defects are programming errors that may occur because of errors in the source code, requirements, or design.Defects negatively affect software quality and software reliability [1].Hence, they increase maintenance costs and efforts to resolve them.Software development teams can detect bugs by analyzing software testing results, but it is costly and time-consuming by testing entire software modules.As such, identifying defective modules in early stages is necessary to aid software testers in detecting modules that required intensive testing [2] [3].
In the field of software engineering, software defect prediction (SDP) in early stages is vital for software reliability and quality [1] [4].The intention of SDP is to predict defects before software products are released, as detecting bugs after release is an exhausting and time-consuming process.In addition, SDP approaches have been demonstrated to improve software quality, as they help developers predict the most likely defective modules [5] [6].SDP is considered a significant challenge, so various machine learning algorithms have been used to predict and determine defective modules [7].With the end goal of expanding the viability of software testing, SDP is utilized to distinguish defective modules in current and subsequent versions of a software product.Therefore, SDP approaches are very helpful in allocating more efforts and resources for testing and examining likely-defective modules [8].
Commonly-used SDP strategies are regression and classification strategies.The objective of regression techniques is to predict the number of software defects [5].In the literature, there are a number of regression models used for SDP [9] [10] [11] [12].In contrast, classification approaches aim to decide whether a software module is faulty or not.Classification models can be trained from the defect data of the previous version of the same software.The trained models can then be used to predict further potential software defects.Mining software repository becomes a vital topic in research for predicting defects [13] [14].
Clustering algorithms such as k-means, x-means, and expectation maximization (EM) have also been applied to predict defects [33] [34] [35].In addition, the experiment outcomes in [34] [35] showed that x-means clustering performed better than fuzzy clustering, EM, and k-means clustering at identifying software defects.Aside from those, transfer learning is a machine learning approach that expects to exchange the information learned in one dataset and utilize that learning to help tackle issues in an alternate dataset [36].Transfer learning has also been introduced to the field of SDP [37] [38].
Software engineering data, such as defect prediction datasets, are very imbalanced, where the number of samples of a specific class is vastly higher than another class.To deal with such data, imbalanced learning approaches have been proposed in SDP to mitigate the data imbalance problem [7].Imbalanced learning approaches include re-sampling, cost-sensitive learning, ensemble learning, and imbalanced ensemble learning (hybrid approaches) [7] [39 Section 2 summarizes software metrics that can be used as attributes to identify software defects.Section 3 presents evaluation metrics that can be used to measure the performance of SDP models.Sections 4 and 5 detail the experimental methodology and results, respectively.Section 6 presents the threats to validity.Related works are described in Section 7.

Software Metrics
A software metric is a proportion of quantifiable or countable characteristics that can be used to measure and predict the quality of software.A metric is an indicator describing a specific feature of a software [6].Identifying and measuring software metrics is vital for various reasons, including estimating programming execution, measuring the effectiveness of software processes, estimating required efforts for processes, deduction of defects during software development, and monitoring and controlling software project executions [5].
Various software metrics have been commonly used for defect prediction.The first group of software metrics is called lines of code (LOC) metrics and is con-

Evaluation Measures for Software Bugs Prediction
In this section, we will discuss different measurements for software defect pre-

( ) ( )
Accuracy TP TN TP TN FP FN ( ) ( ) ( ) ( ) G-measure is another measure used in software defect prediction.It is defined as a harmonic mean of recall and specificity.Probability of false alarm (PF) is the ratio of clean instances wrongly classified as defective (FP) among the total clean instances (FP + TN).

Experimental Methodology
For the experiments, 10 well-known software defect datasets [62] were selected.
The majority of related works used these datasets to evaluate the performance of their SDP techniques and this is the reason behind selecting the above-mentioned dataset for further comparisons.Table 1 reports the datasets used in the experiments along with the statistics.RF, DS, Linear SVC SVM, and LR were chosen to be the base classifiers.Boosting and bagging classifiers for all the base classifiers were also considered.The experiments were conducted on a Python environment.The classifiers' performances in this study were measured using classification accuracy, precision, recall, F-score, and ROC-AUC score.It is important to highlight that these metrics were computed using the weighted average.The intuition behind selecting the weighted average was to calculate metrics for each class label and take the label imbalance into the account.

Experimental Results and Discussion
Table 2 summarizes the performance of the different classifiers based on the classification accuracy.The RF classifier achieved accuracies of 0.91, 0.84, 0.90, 0.82, 0.97, and 0.83 for the PC1, PC3, PC4, KC2, MC1, and CM1 datasets, respectively.In addition, it is obvious that the RF classifier attained the highest accuracy scores for the PC1, PC3, PC4, KC2, MC1, and CM1 datasets compared to other classifiers, indicating better predictions of defective instances performed by the RF classifiers in these datasets.Moreover, the reported scores in Table 2 show that the bagging classifier with DS as a base learner performed well on the PC5, KC3, and MC2 datasets as compared to the other classifiers.
In Figure 1, it is clear that the RF classifier obtained the highest accuracy scores for all datasets, except PC5, JM1, KC3 and MC2.Furthermore, the maximum accuracy attained for PC1 was 0.91 whereas the minimum value was 0.78  obtained by LR for the same dataset.Among the base learners, RF was the best performing classifier for all datasets, while SVM was the worst classifier for all datasets, except KC2, MC2, and CM1.Besides, Bagging with DS achieves higher accuracy scores for PC3, PC5, KC3, MC2, CM1 compared to the other bagging and boosting methods.
Table 3 reports the F-scores attained using different classifiers.In general, it is apparent that the RF classifier was the best performing for six different datasets, as illustrated in Table 2 and Table 3.For PC1, PC3, PC4, KC2, MC1, and CM1, the RF classifier attained the highest F-scores compared to the other classifiers, indicating better predictions obtained by RF.In addition, the reported F-scores presented that AdaBoost classifier with RF as a base learner attained similar scores to RF for the PC3, PC4, KC2, and MC1 datasets.Furthermore, bagging with DS achieved higher F-scores compared to other classifiers for PC5, KC, and MC2.
Figure 2 illustrates bar plots of the F-scores attained using classifiers for all datasets.For the PC3, PC4, PC5, and JM1 datasets, it is obvious that the SVM,  AdaBoost (SVM), and bagging (SVM) classifiers performed badly, as reported in Table 3.For JM1, the highest F-score was 0.77, attained by bagging (RF).Additionally, the lowest score was 0.71, which was attained using six different classifiers.Furthermore, the F-scores achieved by bagging (LR) were the minimum for KC3 and MC1.LR was the worst classifier for the MC2 and CM1 datasets.
The ROC-AUC scores achieved by all participating classifiers are shown in Table 4.For PC1 and PC3, the LR and bagging with LR classifiers attained the highest ROC-AUC scores, achieving 0.77 for PC1 and 0.74 for PC3.The bagging with the RF algorithm as base estimator performed well in terms of ROC-AUC scores, reaching 0.84, 0.71, and 0.64 for the PC4, PC5, and JM1 datasets, respectively.The ROC-AUC score of AdaBoost with the LR classifier on data set KC2 was the best among all the classifiers, achieving a score of 0.78, while the lowest value was 0.66 and was attained by the AdaBoost with DS.The SVM, AdaBoost (SVM), and bagging (SVM) classifiers achieved the highest ROC-AUC scores for the CM1 and MC1 datasets.
Figure 3 shows the bar plots of ROC-AUC scores attained by all classifiers.It is clear there is no dominant classifier and this may due to the nature of datasets.
For instance, LR and bagging (LR) classifiers performed well on PC1 and PC3 datasets, while these classifiers did not achieve the highest ROC-AUC scores for other datasets.
Our findings demonstrate that there was uncertainty in the classifiers' performances, as some classifiers performed well in specific datasets but worse in others.Similar to other studies [6] [22] [58], our results recommend using ensembles as predictive models to detect software defects.Additionally, their findings [6] [22] [58] agreed with our outcome that RF performed well.However, the experiments conducted by Hammouri et al. [61] purported that the best performing algorithm was DS, while our study's findings confirmed that DS performed badly, unless it was used as a base learner with bagging classifiers for some datasets, as reported in Table 2 and Table 3.

Threats to Validity
In this section, we list some potential threats in our study and responses to construct validity.
1) The selection of datasets may not be representative.One potential threat to validity is the selection of datasets where they might not be representative.In our study, this threat is mitigated by evaluating the performance of the classifiers on ten well-known datasets that are commonly used in the literature review.
2) The generalization of our results.We have attempted to mitigate this threat by measuring the performance of the base learners, boosting, and bagging classifier on diverse datasets that have different sizes.
3) The trained classifiers may over-fitting and bias the results.Instead of splitting the datasets randomly using the simple train-test split (70% -80% for training and 30% -20% for testing), we split the dataset into training and testing sets using the 10-fold cross validation to avoid the over-fitting issue that might be caused using the random splitting.

Related Works
Kalai Magal et al. [28] combined feature selection with RF to improve the accuracy of software defect predication.Feature selection was based on correlation computation and aimed to choose the ideal subset of features.The selected features using correlation-based feature selection were then used with RF to predict software defects.Various experiments were conducted on open NASA datasets from the PROMISE repository.The outcome showed clear improvements obtained using the improved RF compared to the traditional RF.Venkata et al. [9] explored various machine learning algorithms for real-time system defect identification.They investigated the impact of attribute reduction on the performance of SDP models and attempted to combine PCA with different classification models which did not show any improvements.However, the outcomes of the experimental results demonstrated that combining the correla-tion-based feature selection technique with 1-rule classifier led to improvements in classification accuracy.Anuradha and Shafali [58] investigated three supervised classifiers: J48, NB, and RF.Various datasets were selected to assess the classifiers efficiency at detecting defective modules.The conducted experiments demonstrated that the RF classifier outperformed the others.Moreover, Ge et al. [6] showed that RF performed well compared to LWL, C4.5, SVM, NB, and multilayer feed forward neural networks.On the other hand, Singh and Chug [59] analyzed five classifiers-ANN, particle swarm optimization (PSO), DS, NB, Linear classifier (LC)-and compared their performance in terms of detecting software defects.
The experiment results showed that LC outperformed the other classifiers.
Aleem et al. [27] compared the performance of 11 machine learning methods and used 15 NASA datasets from the PROMISE repository.NB, MLP, SVM, AdaBoost, bagging, DS, RF, J48, KNN, RBF, and k-means were applied in their study.The results showed that bagging and SVM performed well in the majority of datasets.Meanwhile, Wang et al. [22] carried out a comparative analysis of ensemble classifiers for SDP and demonstrated that voting ensemble and RF attained the highest classification accuracy results compared to AdaBoost, NB, stacking, and bagging.Perreault et al. [19] compared NB, SVM, ANN, LR, and KNN on five NASA datasets.The outcomes of the conducted experiments did not show a superior classifier at identifying software defects.Hussain et al. [60] used the AdaboostM1, Vote and StackingC ensemble classifier with five base classifiers: NB, LR, J48, Voted-Perceptron and SMO in Weka tool for SDP.The experimental results showed that StackingC performed well compared to the other classifiers.
Hammouri et al. [61] assessed NB, ANN, and DS for SDP.Three real debugging datasets were used in their study.Measurements such as accuracy, precision, recall, F-measure, and RMSE were utilized to analyze the results.The results of their study showed that DS performed well.
The above-mentioned approaches differ from the proposed approach in this paper in two ways.Firstly, we compared the performance of different supervised and Ensemble methods on the oversampled training data, while other works such as Kalai Magal et al. [28] and Venkata et al. [9] focused on the impact of feature selection and attribute reduction on the performance of classifiers.Secondly, a very similar study to our approach presented in this paper was conducted by Alsawalqah et al. [63], where they studied the impact of SMOTE on the Adaboost ensemble method with J48 as a base classifier.Their findings demonstrated that SMOTE can help to boost the performance of the ensemble method on four NASA datasets.This differs from our study presented in this paper is that we compared varieties of ensemble methods on the oversampled training dataset, while Alsawalqah et al. [63] used only Adaboost with J48 as a base classifier.
The general finding in these related works is that classifiers such as RF, bagging, DS, Adaboost performed well in the SDP problem.Therefore, we have fo- sidered basic software metrics.LOC metrics are typical proportions of software development.Many studies in SDP have proven a clear correlation between LOC metrics and defect prediction[43] [44].One of the most common software metrics widely used for SDP are the cyclomatic complexity metrics, which were proposed by McCabe[45] and are used to represent the complexity of software products.McCabe's metrics (cyclomatic metrics) are computed based on the control flow graphs of a source code by counting the number of nodes, arcs, and such as true positive (TP), true negative (TN), false positive (FP) and false negative (FN).TP denotes the number of defective software instances that are correctly classified as defective, while TN is the number of clean software instances that are correctly classified as clean.FP denotes the number of clean software instances that are wrongly classified as defective, and FN denotes the number of defective software instances that are mistakenly classified as clean One of the primary simple metrics to evaluate the performance of predictive models is classification accuracy, also called the correct classification rate.It is utilized to quantify the extent of the effectively classified instances to the aggregate instances.Another measure is called precision, and it is calculated by dividing the number of instances correctly classified as defective (TP) by the total number of instances classified as defective (TP + FP)[16].In addition, recall measures the percentage of the number of instances correctly classified as defective (TP) to the total number of faulty instances (TP + FN)[16].F-score is a harmonic mean of precision and recall, and many studies in the literature used F-score metrics[56] [57].ROC-AUC calculates the area under the receiver operating characteristic (ROC) curve by computing trade-offs between TPR and FPR.
The performance of classifiers was evaluated based on 10-fold cross-validation to split the datasets into 10 consecutive folds.One of them for testing and the remaining folds for training.Afterwards, features were standardized and scaled using the standard Scaler function in Python, which works by removing the mean and scaling the features into unit variance.Since the datasets were very imbalanced, the oversampling approach using SMOTE was performed for the training data only, as it has been widely used in the literature to mitigate imbalance issues in training data for SDP.The following Algorithm 1 was used for the experiments.It began by providing a list of datasets and a list of classifiers and then proceeded to iterate over all datasets, as shown in Line 8. The datasets were split into training and testing data based on 10-fold cross-validation with shuffling of the data before splitting, as shown in Line 9. One the dataset was split, the perform Standard Scaler function was utilized to standardize and scale the features.Once the features were standardized, the training data for each fold were re-sampled using the SMOTE technique, as shown in Line 11.As mentioned above, SMOTE oversampling has been widely used in SDP.The loop in Lines 12 -25 aimed to train the classifiers, obtain predictions, and compute evaluation metrics.The average metrics were computed in Lines (27 -31) as the datasets were split using 10-folds.The process from Lines 9 -31 was iterated throughout all provided datasets.Algorithm 1.The experimental procedure for software defect perdition.A. Alsaeedi, M. Z. Khan DOI: 10.4236/jsea.2019.12500791 Journal of Software Engineering and Applications

Figure 1 .
Figure 1.Classification accuracy scores of different classifiers.
38Journal of Software Engineering and Applications

Table 2 .
The accuracy scores obtained using different classifiers.

Table 3 .
The F-scores obtained using different classifier.
77 Journal of Software Engineering and Applications

Table 4 .
The ROC-AUC scores obtained using classifiers.