Software Effort Prediction Using Ensemble Learning Methods

Software Cost Estimation (SCE) is an essential requirement in producing software these days. Genuine accurate estimation requires cost-and-efforts factors in delivering software by utilizing algorithmic or Ensemble Learning Methods (ELMs). Effort is estimated in terms of individual months and length. Overestimation as well as underestimation of efforts can adversely affect software development. Hence, it is the responsibility of software development managers to estimate the cost using the best possible techniques. The predominant cost for any product is the expense of figuring effort. Subse-quently, effort estimation is exceptionally pivotal and there is a constant need to improve its accuracy. Fortunately, several efforts estimation models are available; however, it is difficult to determine which model is more accurate on what dataset. Hence, we use ensemble learning bagging with base learner Linear regression, SMOReg, MLP, random forest, REPTree, and M5Rule. We also implemented the feature selection algorithm to examine the effect of feature selection algorithm BestFit and Genetic Algorithm. The dataset is based on 499 projects known as China. The results show that the Mean Mag-nitude Relative error of Bagging M5 rule with Genetic Algorithm as Feature Selection is 10%, which makes it better than other algorithms.


Introduction
For software developers the quality of a software product is vital, and software cost estimation efforts help developers to maintain good quality. Software cost estimation in terms of the persons-months and time to complete the project is crucial. Though software cost estimation plays a vital role in the field of software development, there have been minor developments in this area in the last few decades. The most important reason for the failure of a project is poor cost estimation. Even though there are many efforts models available, novel methods for improving the accuracy of projects are still needed. So, the development of a software efforts prediction model is motivation to estimate software efforts as accurately as possible. Software cost estimation predictions are used to forecast the cost of software. Machine Learning methods use the historical dataset for predicting the actual cost for future software. The fundamental purpose for using Machine Learning systems is to become familiar with the inalienable examples of feature value and their relations with venture endeavours (project efforts) and anticipate the efforts for new software projects.
The ML approaches have been utilized as a commendation for both master judgment and algorithmic models in the past decade. These methodologies incorporate Artificial Neural Networks (ANN), Fuzzy rationale, bagging, boosting, decision trees, Support Vector Machine (SVM) and so on. The upside of these methodologies is that they show the mind-boggling connection between efforts and free factor. It is utilized for those troublesome issues where an outcome must be gained from authentic historical information. In the literature many machine learning approaches have been found, though it is very difficult to say which approach is better.
Software efforts estimation plays a very vital job in calculating the cost for developing the software project. The understanding and controlling of basic factors that influence programming cost is an exceptionally fundamental job in software project management. Software measurements are the software product measures and qualities. Since software estimations are basic in software engineering, there have been numerous investigations over the most recent four decades to give a thorough view of software's complex nature and to utilize it in software cost estimation and software examination. Despite the fact that the principal software measurements (metrics) book were published in 1976 [1], the historical backdrop of software measurements explorations dates to the 1960s, when the lines of code (LOC) metric was utilized to quantify the profitability of the developer and software complexity and quality. LOC was utilized as a principle key in efforts prediction for some forecast models, for example, [2] [3].
In the mid-1970s, the enthusiasm for software design complexity expanded when diagram hypothetical unpredictability was discussed by McCabe in [4]. He built up a scientific strategy for program modularization. A few meanings of the graph hypothesis were utilized so as to measure and control the quantity of ways through a software program known as the Cyclomatic Complexity metric. At that point this metric had been utilized for complexity estimations rather than size metrics. In 1984, Basili and Perricone [5] found a connection between McCabe's Cyclomatic Complexity and module sizes. They found that enormous modules have high intricacy.
Fei, Zhi and Chao [6] proposed an enhancement for the Halstead complex nature measurements. They added weights to the Halstead metrics. They gave various operators and operands various weights. Six object-oriented design metrics items were created and assessed by Chidamber and Kemerer in 1994 [7]. These items are called CK measurements. The CK measurements that came about because of Chidamber and Kemerer are weighted methods per class (WMC), depth of inheritance tree (DIT), number of children (NOC), coupling between object classes (CBO), response for a class (RFC), and lack of cohesion in methods (LCOM). As per Smith, Hale and Parish [8], 4 task factors-force, concurrency, fragmentation and team size-have been considered for their effect on software efforts developments. Every one of these elements improved the estimations of the middle of the road COCOMO I model. These elements alongside un-balanced capacity focuses help in developing a superior effort estimation model which brings about improved predictive capacity when contrasted with COCOMO model.
Tosun, Turhan and Bener [9] proposed another novel methodology for improving the estimation precision with the assistance of another element weight assignment algorithm which gives better outcomes when contrasted with past research. Here a factual procedure called Principal Component Analysis (PCA) was used to actualize the two weighted task heuristics. Pahariya, Ravi, Carr and Vasu [10] proposed new computational knowledge sequential hybrid architectures including programming and Group Method of Data Handling (GMDH). This incorporates information mining strategies, for example, Multi-Layer Regression (MLR), Radial Basis Function (RBF), etc. [10]. Different investigations of ANN models for anticipating SCE are [10]- [17]. Andreou and Papatheocharous [18] utilized Fuzzy Decision Trees (FDTs) for foreseeing required efforts and code size in cost estimation as though solid proof about those fluffy changes of cost drivers added to improving the forecast procedure. More researches on this topic can be found in [19]- [25]. Reddy and Raju [26] improved fuzzy methodology for software efforts of the COCOMO utilizing the Gaussian membership function, which performs superior to the trapezoidal capacity to display cost drivers. In this paper, we describe COCOMO models, then explain the roles and application of machine learning techniques like Linear Regression, Support Vector Machine (SMOReg), Neural networks, MRules 5, REPTree and Random Forest. Then we apply ensemble Learning based on these classifiers. However, we compare the outcomes of these methods with results in given actual and estimated efforts.
As we have seen, software vaults or datasets are generally used to acquire information on which efforts estimation is finished. Yet software stores contain data from heterogeneous ventures. Customary utilization of regression equations to derive a single mathematical model results in poor performance [27]. Gallogo [27] utilized Data clustering to tackle this issue. In this study, the models are predicted and validated using statistics and ensemble Learning methods. Comparison with previous research is also done. The result shows that Bagging M5 rule with Genetic algorithm for feature selection shows MMRE 10%. Here we try to answer the following research questions.
RQ1. What is the impact of BESTFIT and GENETIC Algorithms for Feature Selection for software efforts Prediction on two datasets when the performance is measured using four metrics, MMRE (mean magnitude of relative error) and prediction at levels 0.25, 0.50 and 0.75 respectively?
RQ2. What is the performance of ensemble Learning Techniques? We determine which ML systems give the best and worst outcomes relating to each dataset explored in the investigation.

Dataset
In this study we have taken the China dataset for software cost estimation from the Promise Data repository [28]. The China Dataset is comprised of 19 features: 18 autonomous variables and 1 ward variable. It has 499 instances corresponding to 499 projects. The clear insights of Chinese informational collection are in the appendix in Table 1. A set of autonomous variables chooses the estimation of the needy variable. The needy variable is efforts right now and the independent factors might be removed, as they may have little impact on predicting the efforts, consequently making the model much less difficult and productive. It has been seen from the China informational index that independent variables ID and Dev.Type do not play any role in deciding the value of effort. Consequently, variables ID and Dev are autonomous. Here we perform Cross-validation, a standard evaluation method that is an orderly method for running repeated percentage splits. It consists of partitioning a dataset into 10 pieces ("folds"); at that point hold out each piece for testing and train on the 9 staying together. This gives 10 assessment results, which are the average.

Feature Selection Method
There are different strategies utilized for diminishing information dimensionality. We have utilized the Feature sub-selection procedure given in the WEKA tool [29] to diminish the quantity of the independent variable. Applying CfsSubsetEval with BestFit feature selection method reduces 19 features to 7 features. When Genetic Algorithm is used for feature selection, 19 features are reduced to 9. The best combination of independent variables was searching through all possible combinations of variables. The dependent variable is Efforts. Software development efforts are defined as the work done by the product provider from detail until delivery estimated as far as hours.

Performance Measures
Mean Magnitude of relative error (MMRE) (or mean absolute relative error) currently utilizes the most effective and standard measures for estimation exactness, for example, MMRE and PRED at power levels 0.25, 0.50 and 0.75, respectively.
here, E i represents the estimated value for a data point, A i represents the actual value of each data point, and k is the total number of data points. Here, Predict(A) is calculated as follows: Mean Relative Error (MRE) is denoted by (m) and includes the values when data points have less than or equal to A error. It is common to consider (25%) as the reference value [32].

Relative Absolute Error
Relative Absolute Error (RAE) calculates the accuracy of a predictive model. RAE can be used in machine learning. Furthermore, RAE is expressed as the ratio; it computes the mean error (residual) of errors produced by a trivial or naive model. The model is considered non-trivial if the result is less than 1. This is the model for a data set (k): where E i 's is prediction, D i 's is actual values, and Rae is the measure of forecast accuracy. D is the mean of D i 's; n is the size of the dataset (in data points) [32] [33] [34].

Root Relative Squared Error
Root relative squared error (RRSE) takes the average errors, squares them and normalizes the average. Then, in order to maintain the error to the same dimension, the square root is calculated. In the following equation, E i of an individual model i is shown: where E i 's is prediction, D i 's is actual values, Rae is the measure of forecast accuracy, D is the mean of D i 's, and n is the size of the dataset (in data points) [32] [33] [34].

Relative Absolute Error
Relative Absolute Error (RAE) calculates the accuracy of a predictive model.
where E i 's is prediction, D i 's is actual values, Rae is the measure of forecast accuracy. D is the mean of D i 's, and n is the size of the dataset in data points.

Root Relative Squared Error
where E i 's is prediction, D i 's is actual values, and Rae is the measure of forecast accuracy. D is the mean of D i 's; n is the size of dataset in data points.

Machine Learning Techniques
Right now, we are utilizing machine learning methods to predict effort. We have used ensemble learning method bagging with base Learner Linear Regression, Support Vector Machine, Neural Network (MLP), MRules 5, REPTree, and Random Forest.

Linear Regression
Linear regression (LR) is widely used for predictive analysis. Basically, LR measures the degree to which variables are linearly related. The formula for linear regression is: Furthermore, multiple linear regression (MLR) is an empirical model that utilizes data from past results. According to Liung and Fan MLR is an empirical model that utilizes data from past results in order to measure current results [33].
where y i is the dependent variable, x i is the explanatory variable, β 0 is the y-intercept (constant), β p is slope coefficient for each explanatory variable, and  is the model's residuals (errors).

Multilayer Perception (MLP)
A multilayer perceptron (MLP) is used for regression; however, in MLP an intermediate layer (hidden layer) is used instead of using input as feed. . Here, y(v i ) is the output of the i th node (neuron), and v i is the weighted sum of the input connections

Sequential Minimal Optimization Regression
Sequential minimal optimization (SMO) is useful when applied to solve quadratic programming (QP) problem that comes out as a result of applying the training of support-vector machines. When looking at a binary classification problem with a dataset pairs (x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n ), where x i is an input vector and y i can be a label of either (−1) or (+1). A soft-margin support vector machine is then trained by solving a quadratic programming problem, which is expressed in form: (here C is a support-vector machine hyperparameter).

REPTree
REPTree is based on regression tree logic. It basically creates a number of trees; the process is done on different iterations. Then, REPTree selects the best one from all generated trees. The selection will be considered representative. Then the mean square error is applied on the trees' prediction [35].

Decision Tree
Decision tree is a regression methodology. Basically, it provides an easily understandable modelling technique. Moreover, even if there are some imperfections in the data, such as missing values, it can predict patterns to overcome such issues. Decision tree is an approach that uses continuous recursive partitioning until it reaches classification of a dataset data. It can be of two types, breadth first (BF) or depth first (DF). In both a greedy algorithm is typically applied. One drawback of decision trees is overfitting of data samples [35].

Bagging
Bagging is a method to stabilize the accuracy of machine learning. Bagging helps reduce the issue of overfitting found in some regression methodologies such as decision tree. Hence, it is a method that iteratively samples from a certain data set according to a rectangular probability distribution with substitution [32] [35] [36]. In Bagging, each sample has the exact same size as the original data. Here, we should indicate that in sampling with replacement there is a chance that some dataset instances may never get a chance to be selected while others can be selected multiple times.

Random Decision Forest (RDF)
Random decision forest (RDF) was first introduced by Ho [35]. Random decision forest (RDF) corrects the problem of overfitting found in decision trees.
Moreover, RDF is a learning scheme for regression; in RDF multiple decision trees are built. During the training of a model decision trees are built, and eventually an average prediction is reached. Moreover, the RDF training process for random forests uses the bagging technique.

Experiment Design
In the literature we have found some limitations and from our experiment, we discovered that most researchers ignore the steps in their Pre-processing. Before pre-processing remove the missing and noisy data from the dataset. Besides these limitations, attribute selection is another important limitation that directly affects the memory use and also affects the results. So, to overcome these limitations we follow these basic steps (see Figure 1).
○ Input as a dataset ○ Preprocess the data:  by filtering out noisy and missing data,

Results
Figure 2(a) shows the predicted effort compared to the actual effort using the China dataset using best fit algorithm and RF bagging. Results show a strong correlation between predicted and actual. Figure 2(b) also shows the same but by applying REPTree. The results are shown in Figure 4, and in Table 2 and Table 3. Our results of effort estimation model predicted using bagging M5 rule with genetic Algorithm Feature selection method was better than all the other 12 methods examined in our study. This shows that bagging M5 rule has very good results for software efforts. The MMRE is 10% while Pred (25), Pred (50) and Pred (75) have 97%, 98% and 99%, respectively. This shows that performance of Bagging M5 rule is excellent, even when we compare our results in existing research. Table 2 also shows that the performance of Ensemble learning is best among all. Table 3 shows that the correlation of actual and predicted values is also very high. The charts in Figure 5 and Figure 6, for the real qualities and the qualities as predicted by the specific model, appear on the Y-axis and compare to the 499 Projects. The "'blue" band displays the bend for the real values, though the "red" band introduces the band for the predicted qualities. The closer the real and       predicted bands, the lower the error and the better the model. The charts show that the real and the predicted qualities are exceptionally near one another. Answer to Research Questions RQ1: Impact of feature selection algorithm on different ensemble learning algorithms.
If we look at Table 2 and Table 3 when we use BestFit feature selection the MMRE is minimum in all cases. If we check the PRED 0.25, 0.50 and 0.75 the value is high; this shows that the performance of all algorithms is best, while by using genetic algorithm for feature selection the value of MMRE is increased and the values of PRED 0.25, 0.50 and 0.75 are decreased so the impact of feature selection is very high in this study.
RQ2. What is the performance of ensemble Learning Techniques?
The performance of bagging M5Rule is best among all the algorithms, even for feature selection cases; at the same time, they are highly correlated in the comparative analysis available in Table 3.

Conclusion
In this study we perform comparative analysis among twelve ensemble methods for predicting the efforts. We use the Promise data set repository for predictions. The data set contains nineteen (19) features so we use two feature selection methods named BestFit and Genetic Algorithm on six ensemble learning algorithms. The results show that the Genetic Algorithm feature selection for the bagging M5 rule is the best method for predicting efforts with MMRE value 10%, and PRED (25), PRED (50) and PRED (75) have values 97%, 98% and 99%, respectively. In the future researchers can use Ensemble Learning with different feature selection methods for predicting efforts estimation. Hence, the ensemble learning method shows ability for predicting efforts.