Fault Prediction with Static Software Metrics in Evolving Software: A Case Study in Apache Ant ()
1. Introduction
Testing is a crucial part of the software development life cycle [1]. Ultimately, the purpose of testing is to expose all faults in the software system. A solid testing strategy can provide a high level of confidence about the correctness of an application after it has been deployed. However, software testing can be resource-demanding [2]. Detecting faults in a system randomly may not be feasible [3] especially when dealing with large-scale projects. Practitioners (developers and testers) want to allocate resources in the most effective ways to find faults.
Prior research [4] shows that a fault found after deployment can be 100 times as costly to fix in an early stage. Researchers strive to find a way to help practitioners to detect software faults as early as possible [5]. The decisions of when and where to put the testing efforts are often based on developers’ experience and expertise. This approach might not be reliable. It may not even be sustainable and consistent as developers move in and out of an organization [6]. The experience-based approach also varies a lot since practitioners have different perspectives regarding how to conduct testing.
With recent advancements in applying AI technologies to software engineering problems [7], many research reports promising preliminary results using machine learning techniques to predict faults in software systems [8] [9] [10]. This study explores what software metrics [11] are suitable for constructing fault prediction models and examine how well those machine learning models perform in predicting faults.
Unlike prior research [12] that depends on similar projects to build the prediction model, this study collects training data through different versions of the same project. Out approach outputs a much reliable representation of the application to build fault prediction models. It is also more practical to collect training data as the subject project evolves than to search for similar projects in the wild.
This study aims to answer the following research questions when conducting the empirical study.
· What static software metrics can provide the best faults prediction result?
· Which machine learning models give the best fault prediction results?
· How well do prediction models perform across the continuous versions of the subject program?
We make the following contributions in this paper.
· An empirical study in fault prediction with software metrics.
· An evaluation of four different fault prediction models.
· A publicly accessible data set.
· A publicly accessible machine learning code (in MATLAB).
This paper is organized as follows. In Section 2, we present the overall approach of the empirical study. In Section 3, we discuss the research questions and explain the design of the experiments. In Section 4, we examine the study results. In Section 5, we discuss the sensitivity analysis and the threat to validity. Lastly, we conclude the empirical study in Section 6.
2. Approaches
In this section, we discuss the overall approach adopted by this empirical study. Figure 1 shows an overview of the approach. In the pre-processing phase, we extract and synthesize software metrics [8] [13] from the subject programs. In the model construction phase, we build fault prediction models and conduct sensitivity analysis to fine-tune the model hyperparameters.
2.1. Data Pre-Processing
We use static code analysis [14] to extract software metrics. Static software metrics is chosen over runtime software metrics for consistency concerns. For instance, instrumentation and monitor tools may be used to get the runtime metrics which may introduce high runtime overhead and disturb the execution of the subject program [15]. Also, depending on the deployment environment (e.g., physical or virtual machines running the subject program), we may get a completely different set of metrics readings [16]. Table 1 lists the static software metrics used in the study.
Static software metrics undergo a series of pre-processing steps. First, we apply normalization [17] to bring metrics to the same scale while maintaining relative significance. For example, the value of the metric “Lack of Cohesion in Methods
(LCOM)” [13] could range from 0 to 2247 in dataset 3 before normalizing to the range of 0 to 1. It reduces the dramatic range in the metrics value space that may otherwise negatively affect the accuracy of the prediction models.
Not all static software metrics are suitable for constructing the fault prediction model. Some of them may even reduce the model’s accuracy. Next, a forward and backward feature selection [18] is applied to reduce the feature space dimensionality and to achieve greater generalization.
2.2. Fault Prediction Models
In the second phase, we apply both supervised and unsupervised machine learning techniques to build fault prediction models. Decision Tree (DT) [19] is a classic supervised learning model. The tree is constructed by a recursive binary split on which the selected node maximizes local information gain [13]. We use Gini’s Diversity Index [20]
for tree pruning. Random Forest (RF) [21] is an ensemble method. RF combines an arbitrary number of decision trees. The number of decision trees used for each data set is based on a sensitivity analysis which will be discussed in Section 5. Support Vector Machine (SVM) [22] is a linear classification model that maximizes the decision boundary. The linear kernel is used for two-class learning.
where xjand xk are two observations. And an error-correcting output codes (ECOC) model for multi-class learning. K-nearest neighbor (KNN) [23] is an unsupervised learning method. KNN assumes that if two data points are similar, they are likely to be in the same class. We use the euclidean distance to calculate the shortest distance between a data point and the cluster’s centroid. We conduct a sensitivity analysis to evaluate different k values and select the k value that gives the least misclassification error [24] to construct KNN. To avoid overfitting [25], ten-fold cross-validation [26] is applied to all four models. Since ten-fold cross-validation randomly samples instances and puts them in ten folds [27], the process is repeated ten times for each model to avoid sampling bias [28].
3. Empirical Study
In this section, we discuss details of the implementation, subjects, and data set design.
3.1. Implementation
The experiment runs on a Mac OS X with a quad-core 2.4 GHz Intel Core-i5 CPU, 16 GB of memory, and 256 GB of SSD. We use CodeAnalyzer [29] to extract static software metrics. CodeAnalyzer is a light-weighted tool for analyzing source code. To build fault prediction models, we use the MATLAB Statistics Toolbox [30].
3.2. Subjects and Data Sets
Apache Ant is an open-source Java-based build tool. Tour continuous versions (v1.4 to v1.7) of the Apache Ant is used for its popularity [31] and availability [32]. Table 2 shows the characteristics of Apache Ant. The first column (METRIC) shows the size of the Apache Ant. The third column (RATIO) shows the proportion for source code. We refer to the online repository Models In Software Engineering (PROMISE) [32] for Apache Ant faults data. Figure 2 shows the distribution of faults in the four versions of Apache Ant. Color schemes are used in the bar chart to indicate different numbers of faults in a class. For example, in the ANT-V4 data set, 78% (565) classes have zero fault and 12.5% (91) classes have one fault.
To prepare the raw training data set, we associate software metrics (features) of the training data with faults (labels) provided in the PROMISE by the module’s class name. With each modeling iteration, the training data set is expended and fault prediction models are rebuilt using techniques outlined in Section 2.2.
4. Results and Discussion
In this section, we answer the following research questions and discuss the study results.
· RQ1: What static software metrics can provide the best fault prediction result?
Table 2. Apache ant characteristics.
Figure 2. Apache ant faults distribution.
· RQ2: Which machine learning models give the best fault prediction results?
· RQ3: How well do prediction models perform across the continuous versions of the Apache Ant?
4.1. RQ1: Software Metrics
We build models with three sets of metrics (Table 1). For each model, the default, complex, and combined metrics are used as the training data, respectively. Table 3 shows a portion of the training data set for ANT-V1. For example, the “Module” column shows the class name; the “Weighted Methods per Class (WMC)” column is the sum of the complexities of all class methods; the “Bug” column shows the number of faults in the class. Figure 3 shows the performance of fault prediction models with all three sets of metrics. Their performance varies among different data sets. For example, the complex metrics outperform the
Figure 3. Software Metrics Performance. (a) ANT-V1; (b) ANT-V2; (c) ANT-V3; (d) ANT-V4.
default metrics in ANT-V1 with DT but fall short in ANT-V2 compared to default metrics. On average, models built with combined metrics has the lowest misclassification error (0.2).
4.2. RQ2: Fault Prediction Model Performance
To answer RQ2, we compare the performance of models built with individual Apache Ant versions in Figure 4. The performance of fault prediction models varies across Apache Ant versions. For example, RF has a misclassification error of 0.09 in ANT-V2 compared to a misclassification error of 0.254 in ANT-V3. Overall, SVM has the least misclassification error (0.148) followed by RF (0.192), KNN (0.203), and DT (0.259). Figure 4 shows models trained with ANT-V2 have the best performance with an average misclassification error of 0.103 compared to ANT-V1 (0.227), ANT-V3 (0.248), and ANT-V4 (0.225). It is our observation that for linearly separable spaces, KNN is preferred for its interpretability. KNN does require a larger data set for it to work accurately.
4.3. RQ3: Cross Program Training and Fault Prediction
To answer RQ3, we examine whether training data from other project versions can improve fault prediction performance. To prepare the expended data set, we construct a new data set with all previous training data sets. For example, ANT-DS-2 contains data for ANT-DS-1 plus ANT-V2; and ANT-DS-3 contains data for ANT-DS-2 plus ANT-V3. Figure 5 illustrates the performance of each fault prediction model with the expanded data set. Overall, the model prediction misclassification error is equivalent to the regular data set (MCexpanded = 0.206 v.s. MCregular = 0.2). The misclassification error of models built with expanded data set outperform the regular data set in ANT-DS-3 (MCexpanded = 0.198 v.s. MCregular = 0.248), ANT-DS-4 (MCexpanded = 0.209 v.s. MCregular = 0.225) and underperform the regular data set in ANT-DS-2 (MCexpanded = 0.214 v.s. MCregular = 0.103). The results imply in cases when training data for a subject is unavailable, we may
Figure 4. Fault prediction models performance.
Figure 5. Fault prediction with expanded data set.
utilize training data of a different version of the same subject.
5. Discussions
In this section, we discuss the sensitivity analysis and the threats to validity of the empirical study.
5.1. Sensitivity Analysis
One challenge of using machine learning techniques is that we need to find proper values for the hyperparameters. To get a better fault prediction results, we try out different values to fine-tune the model. For example, Figure 6 shows the influence on the number of random trees used in RF. For KNN, a different number of neighbors (Figure 7) were selected to minimize the classification errors. Empirical data indicates for Apache Ant the best number of neighbors fall between 13 and 16.
5.2. Internal Validity
A threat to internal validity for this study is the possible faults in the implementation of our approach and the tools that we use to perform the evaluation. We control this threat by extensively testing our tools and verifying their results against a small program for which we can manually determine the correctness of the results. Another threat involves the selection of hyperparameters [33] used in machine learning techniques. We use the recommended settings for each modeling technique and conduct a sensitivity analysis to fine-tune the parameters. The accuracy of each fault prediction model may also be different with a different implementation. For example, the RF may report a different misclassification rate in scikit-learn [34] and weka [8] [35]. We choose the statistics and machine learning toolbox in MATLAB for its simplicity to use and its popularity (MATLAB has
Figure 6. Number of RF Trees. (a) ANT-V1; (b) ANT-V2; (c) ANT-V3; (d) ANT-V4.
been widely used in both industry and academia).
5.3. External Validity
The primary threat to external validity for this study involves the representativeness of the selected subjects and modeling techniques. Other subjects may exhibit different characteristics and lead to other conclusions [36]. We reduce this threat by studying multiple versions of the subject program. In addition, we apply four different modeling techniques on seven data sets to generalize conclusions.
5.4. Construct Validity
The primary threat to construct validity involves the dataset and software metrics used in the study. To mitigate this threat, we use data sets that are publicly available, well understood, and widely used [32]. We have also applied well-known software metrics in the data set that is straightforward to compute and is less error-prone.
Figure 7. Number of Neighbors. (a) ANT-V1; (b) ANT-V2; (c) ANT-V3; (d) ANT-V4.
5.5. Limitations
The first limitation of this work is that our approach requires the source code to get the training data. In some cases, especially for a legacy program, the source code may not always be available [2]. Second, when preparing for the training data, it is not fully automated. Our approach first extracts static metrics from the source code, and then we manually combine the PROMISE labels (faults) to get the training data set. One solution is to automate the fault prediction model construction as part of the continuous integration (CI) [37]. We can leverage the fault information from the issue tracker to automatically append the labels to the training data set.
6. Conclusion
We conduct an empirical study to examine the effectiveness of building fault prediction models with static software metrics. We examine the effectiveness of metrics to build fault prediction models. We study four different types of fault prediction models with four continuous versions of the Apache Ant. We evaluate the performance of fault prediction models across multiple Apache Ant versions. Our results suggest the fault prediction models built with combined software metrics have the lowest overall misclassification error (0.2). Among all fault prediction models, SVM has the least misclassification error (0.148). Lastly, our results show the fault prediction models built with the expanded data set are equally powerful. In cases when training data for a subject is unavailable, we may utilize training data of a different version of the same subject.