Can We Predict the Change in Code in a Software Product Line Project?

,


Introduction
Predicting the change in software as a product line SPL is vital in the development cycle. In the development cycle, it takes one year for a new release to be deployed. Before that, the new release goes under the testing period to report issues or bugs. In the testing time, it is imperative to predict what files are likely to face change (any change). These units should be given more attention, which helps to plan well and reduce cost by correctly estimating resources to be allo-How to cite this paper: Alshehri, Y.A. The predicted change of a software unit can be minor or significant. Our goal in this research is to detect the change regardless of the nature of the size of change. To achieve that, we used machine learning models to learn from files of the current release and use them to predict the change in the next release. We use static code metrics to predict the change. Static code metrics explain the main features of a software unit, such as the complexity, number of methods, and cohesion of different classes. Change metrics are not used to predict the change on the next release because change metrics may help to define the nature of the change but not necessarily can predict if the change will occur.
The way we observe the change to a software unit (i.e., file or class) is based on the change in the lines of code. If the software unit encountered added or deleted lines of code, that would make the software unit a change prone unit (i.e., classified as changed). If the software unit did not face any added or deleted lines, the unit would be classified as not changed. We can determine if a file experienced change or not by observing the Code churn metric, which is the total number of added and deleted lines. When the value of this metric is zero, it means that the file has never been changed. In this paper, we measure the ability of our model to predict the next release by learning from the current release. We will test the performance of different algorithms to explore how the results are consistent with each other. Lastly, we explore the performance across all releases are different, and if they are affected by the evolution of the project. The research questions we address in this work are in the following list: -RQ1: Can we predict the change in a software product line project? -RQ2: What releases of the Eclipse project provide good learning to algorithms? Does the size of the dataset improve the training? -RQ3: Does predicting change improve as the product evolved? -RQ4: Does any of the machine learning algorithm performs better than others?
The rest of the paper is organized as follows: Related work is discussed next in Section 2. Then, we explain the data mining approach of this work in Section 3, including machine learning algorithms, metrics, datasets, and performance metrics. We discuss the results in Section 4. Threats to validity are explained in Section 5, and the paper is concluded in Section 6.

Related Works
There are several features related to change in the code. It can be represented through the number of lines added or deleted to a file, the number of authors contributed to the file, the number of revisions, or the number of refactoring.
These features have been successfully used [1]- [8] to predict software fault proneness. In this research, we are interested in using one of the change metrics to predict any change associated with software files. The metric we can use for this purpose is the code churn metric. The code churn metric represents the total number of lines added and deleted to a software class. This metric was used to Y. A. Alshehri Journal of Software Engineering and Applications predict software faults in [2] [6]. In this study, static code metrics are used as input metrics to predict the change. Static code metrics were also used, along with change metrics to predict software faults. In this section, we highlight related works that are targeted the study of software change proneness and used the metric as a response variable but not as a predictor. Some studies did a statistical analysis to investigate the relationship between different classes and bad smell [9] [10] [11]. Other studies applied prediction models to predict the change in software. Abdi et al. [12] used some machine learning algorithms (e.g., J48, Jrip, PART, and NBTree) to predict the change in open source projects. Tsantalis et al. [13] predicted the likelihood of change on software when functions.
Can we predict the change of the code in a software product line project3are added to classes using logistic regressions and measuring the performance using accuracy, sensitivity, false-positive ratio, and false-negative ratio. Genetic programming algorithm with object-oriented metrics was used to predict the change in [14]. Object-oriented metrics were also used with 19 projects, including Eclipse in [15]. Code smell related information was used to improve change prediction in [16].
This work aims to investigate how prediction models can work on the software line project. Eclipse is the chosen project for this work as we have access to seven consecutive releases from Eclipse (Eclipse 2.0, 2.1, 3.0, Europa, Ganymede, Galileo, and Helios). There are some other releases between Eclipse3.0 and Europa that we did not have access to in this work.
In this work, we explore the ability of models to learn from a dataset and test change on the following release. This approach should help to identify files that are likely to experience change from the next release. Predicting these files can be helpful in improving code quality ahead of time by identifying files that are likely to experience the change and learn why they need to be changed.
Also, we use the most known algorithms to conduct these experiments. This is particularly important to explore the generalizability of the performance of learning and testing of these algorithms on these datasets and identify any challenges we may have when using them or if one algorithm is performing better than others.
Lastly, we explore if the performance improved due to the evolution of the project. In this sense, we need to see how the performance differs from the old releases to recent ones.

Methodology
This section discusses the data mining methodology applied in this research.
This includes the type of learners used for prediction, datasets and sampling process, metrics definitions, and performance metrics used to evaluate learners.
The model is trained in a release (release n) and tested on the following release (release n + 1). This means that we should have a total of six tested models on six releases. We cannot test on release n because this would require access to release Journal of Software Engineering and Applications n − 1, which we do not have. This process is iterated four times as we are using four algorithms to train our models. The total number of experiments of this work is twenty-four, which is the product of six releases by four algorithms. The outcome of each experiment is four performance measures (i.e., accuracy, recall, precision, and F-score).

Learners
Several learners (e.g., logistic regression, decision tree, and Naive Bayes) have been used in the software fault proneness area [17]. Many of the top learners provided performances that are not significantly different from each other [18].
Our selection of algorithms is based on three main factors: The popularity of algorithms, algorithms fit the data, and algorithms are easy to implement. In this section, we briefly explain some of the learners we used in this study. LR models describe the probability of the existence of a condition (i.e., fault-prone or fault-free) based on a given set of variables i X . The set of variables is described based on a linear function and then placed into the logit model to calculate the probability ranged between 0 and 1, as shown in Equation (1).
where Y is the response variable (fault prone, fault free), and X i is the independent variable (i.e., metric).
Naive Bayes classification works based on Bayesian rules, as defined by Equation (2). The classifier is famous for its simplicity and fast computation. The classifier works up a set of input metrics (numerical or categorical) as if they are independent of each other. The probability of the response variable is calculated, as shown in Equation (2).
where Y is the response variable (i.e., change prone, non-change prone), and X is the independent variable, k is the number of classes (in our case is two classes), and n is the number of input metrics.
Decision tree J48 works by splitting data based on the most significant splitter (i.e., metric). The splitter is chosen based on the impurity or uncertainty of the data under this subset of data. The decision of splitting is based on calculating the information gain, as shown in Equations (3) and (4). The information gain subtracts the prior entropy of the selected metric i X . The classifier continues splitting data until a tree is formed, starting from the root (i.e., all metrics) and ending with leaves or terminal nodes (i.e., metrics that were not split).
[ ] where C is the desired class, and H[D] is the entropy. Journal of Software Engineering and Applications Random forest is an ensemble tree-based learning algorithm, developed by [19].
Other ensemble classifiers inspired the algorithm (e.g., bagging, random split selection). The algorithm creates multiple trees and takes a majority voting on the predicted class instead of on a single tree decision [20].

Datasets
In this area, different datasets are used [17]. Eclipse is one of the software projects that are used by 50% of studies reported in [17]. In this study, we use seven releases of the Eclipse project (i.e., Eclipse 2.0, 2.1, 3.0, Europa, Ganymede, Galileo, and Helios). The size of the releases is shown in

Metrics
Static code metrics are associated with the change in software [21]. Therefore, we used only static code metrics. Other earlier works used static code metrics (e.g., [22] [23] [24] [25]). Hall et al. [17] found that static code metrics were used by 38% of studies. The earlier work [26] extracted static code metrics used in this work, and [27] extracted the change metrics of this work. Out of all change metrics, we used the Code churn metric and used it in the binary format (i.e., changed/not changed). Changed files are all files that experienced added or deleted lines. Unchanged files are files that had not experienced any change at all.
All static metrics are defined in Table 2.

Performance Metrics
Performance metrics are used to measure the performance capabilities of all learners in predicting classes (change prone or not). In this study, we used four major performance measures, accuracy, recall, precision, and F-score. All these measures are extracted from the confusion matrix (see Table 3).   Accuracy can measure the total number of correct classifications over the miss-classified instances (Equation (5)). Recall measures the rate of the correct classification over the number of instances that are classified as (change prone), which is the total number of true positive and the false negative as in Equation (6).
Precision measures the correct classified instances over the number of instances that are predicted as (change prone), which is the total number of true positive and the false positive instances as in Equation (7). F-score (see Equation (8)) is the harmonic mean of recall and precision.

Results and Discussion
The results discussed in this section are to check the performance of the prediction models, which we developed to predict files that are likely to experience change. The results help to find if the models' performances are good enough to detect any change across all tested releases. Also, the results identify if one algorithm is significantly working better than others or if all algorithms perform with no differences. Lastly, we explore if the performance is affected by the evolution of the software project. In other words, we explore if the last release is significantly higher than previous releases.
The results shown in this section are for Eclipse 2.1, 3.0, Europa, Ganymede, Galileo, and Helios. We trained the prediction models using four algorithms using static code metrics on a release and test on the next release. Therefore, Ec- All recalls are reported between 70% to 100% and logistic regression provided the highest recall compared to all other algorithms. Decision tree J48 came second and random forest at the third place. NB provided incredibly low recalls when trained on small datasets, as shown in the first three releases.
With respect to the precision, the results of all algorithms on all releases are presented in Figure 3. All algorithms scored precisions between 72% to 100%.
Eclipse2.1, Europa, Ganymede, and Helios reported the highest precisions. High precision means that there are less events reported false positive.    Predicting changed or unchanged files require a balanced distribution of the number of changed and unchanged files. In this study, we managed to predict changed files in Eclipse 2.1 and 3.0, and we predicted unchanged files in Europa, Ganymede, Galileo, and Helios. When changed or unchanged files are rare events, then predicting any of them will be unsuccessful due to bad classification. To overcome this problem, we need to apply the oversampling method to get a balanced distribution.
-RQ2: What releases of the Eclipse project provide good learning to algorithms? Does the size of the dataset improve the training?
We found that all datasets provide similar learning because the performance of all tested release is almost at the same level of performance. Only one algorithm (i.e., Naive Bayes) provided different patterns as the learner works well when learning from large datasets (e.g., Europa, Ganymede, Galileo, and Helios).
The algorithm provided a low level of accuracy, recall, precision, and F-score when the algorithm trained on small datasets (e.g., Eclipse 2.0, Eclipse 2.1, Eclipse 3.0).
-RQ3: Does predicting change improve as the product evolved?
When we used the naive Bayes algorithm, the performance (accuracy, recall, and F-score) increased linearly started at 35% accuracy of the first release until it reached 90% accuracy in the last release. The same pattern exists with the recall and F-score. The reason could be due to the sensitivity of the NB algorithm to dataset size and has nothing to do with the evolution of the project.
-RQ4: Does any of the machine learning algorithms perform better than others?
In terms of accuracy, logistic regression performed better than other algorithms on three releases but without a significant difference.

Threats to Validity
This research took all steps to ensure that no threat affects the internal, construct, conclusion, and external validity.
Internal validity is concerned with the quality of the data. The confidence in the data is high as we conduct sanity checks on them to ensure their quality, and they reflect the actual source files.
The construct validity is concerned with that the experiment measured what is intended to measure. We explained what we intended to measure in the introductory part with some research questions. We developed the experimentation on this basis, and we gathered all results, we explained to them and addressed all research questions clearly at the end of the work. We predicted changed files in Eclipse 2.0, 2.1, and 3.0. In other releases, we reported the performance of the models when they predict unchanged files because they were the majority class.
When predicting unchanged files, this means we decided that these groups of files will not require change. Journal of Software Engineering and Applications To ensure the conclusion validity, we applied the algorithms that are common in the area. We provided that the algorithms are fit for the data we used. Our response variable is dichotomous, and the input metrics are in numerical and dichotomous format. The models were evaluated using very common measures, which can help to address all research questions mentioned in the introduction.
External validity can be violated if we claim the generalizability of the results. Our results are valid for the specific releases used from the Eclipse project. We do not generalize the results on other software projects.

Conclusions
In this work, we predicted the change in software files in one of the software product line projects (i.e., Eclipse). We used four algorithms, trained on six releases, and tested on six releases. The training release is the release right before the tested release. We found that predicting changed and unchanged files are possible for all releases. The only problem that could face the software manager is that the balanced distribution of the two classes of the response variable. We found that all algorithms are performing at the same level, except for naive Bayes algorithm when trained small datasets. Lastly, we found that there is not enough evidence to prove that the evolution of the project improves learning.
Our future work will consider predicting the level of change and the type of change that software files are likely to face at every release of Eclipse. Also, we need to consider methods to improve performance (e.g., parameter tuning). We will apply to replicate the work on other software projects to explore the generalizability. Further, we will apply explanatory work to quantify the contribution of explanatory metrics on the response variable.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.