^{1}

^{1}

^{*}

Software Cost Estimation (SCE) is an essential requirement in producing software these days. Genuine accurate estimation requires cost-and-efforts factors in delivering software by utilizing algorithmic or Ensemble Learning Methods (ELMs). Effort is estimated in terms of individual months and length. Overestimation as well as underestimation of efforts can adversely affect software development. Hence, it is the responsibility of software development managers to estimate the cost using the best possible techniques. The predominant cost for any product is the expense of figuring effort. Subsequently, effort estimation is exceptionally pivotal and there is a constant need to improve its accuracy. Fortunately, several efforts estimation models are available; however, it is difficult to determine which model is more accurate on what dataset. Hence, we use ensemble learning bagging with base learner Linear regression, SMOReg, MLP, random forest, REPTree, and M5Rule. We also implemented the feature selection algorithm to examine the effect of feature selection algorithm BestFit and Genetic Algorithm. The dataset is based on 499 projects known as China. The results show that the Mean Magnitude Relative error of Bagging M5 rule with Genetic Algorithm as Feature Selection is 10%, which makes it better than other algorithms.

For software developers the quality of a software product is vital, and software cost estimation efforts help developers to maintain good quality. Software cost estimation in terms of the persons-months and time to complete the project is crucial. Though software cost estimation plays a vital role in the field of software development, there have been minor developments in this area in the last few decades. The most important reason for the failure of a project is poor cost estimation. Even though there are many efforts models available, novel methods for improving the accuracy of projects are still needed. So, the development of a software efforts prediction model is motivation to estimate software efforts as accurately as possible. Software cost estimation predictions are used to forecast the cost of software. Machine Learning methods use the historical dataset for predicting the actual cost for future software. The fundamental purpose for using Machine Learning systems is to become familiar with the inalienable examples of feature value and their relations with venture endeavours (project efforts) and anticipate the efforts for new software projects.

The ML approaches have been utilized as a commendation for both master judgment and algorithmic models in the past decade. These methodologies incorporate Artificial Neural Networks (ANN), Fuzzy rationale, bagging, boosting, decision trees, Support Vector Machine (SVM) and so on. The upside of these methodologies is that they show the mind-boggling connection between efforts and free factor. It is utilized for those troublesome issues where an outcome must be gained from authentic historical information. In the literature many machine learning approaches have been found, though it is very difficult to say which approach is better.

Software efforts estimation plays a very vital job in calculating the cost for developing the software project. The understanding and controlling of basic factors that influence programming cost is an exceptionally fundamental job in software project management. Software measurements are the software product measures and qualities. Since software estimations are basic in software engineering, there have been numerous investigations over the most recent four decades to give a thorough view of software’s complex nature and to utilize it in software cost estimation and software examination. Despite the fact that the principal software measurements (metrics) book were published in 1976 [

In the mid-1970s, the enthusiasm for software design complexity expanded when diagram hypothetical unpredictability was discussed by McCabe in [

Fei, Zhi and Chao [

As per Smith, Hale and Parish [

Tosun, Turhan and Bener [

As we have seen, software vaults or datasets are generally used to acquire information on which efforts estimation is finished. Yet software stores contain data from heterogeneous ventures. Customary utilization of regression equations to derive a single mathematical model results in poor performance [

RQ1. What is the impact of BESTFIT and GENETIC Algorithms for Feature Selection for software efforts Prediction on two datasets when the performance is measured using four metrics, MMRE (mean magnitude of relative error) and prediction at levels 0.25, 0.50 and 0.75 respectively?

RQ2. What is the performance of ensemble Learning Techniques? We determine which ML systems give the best and worst outcomes relating to each dataset explored in the investigation.

In this study we have taken the China dataset for software cost estimation from the Promise Data repository [

Serial Number | Variable | Min | Max | Mean | Standard Deviation |
---|---|---|---|---|---|

1. | ID | 1 | 499 | 250 | 144 |

2. | AFP | 9 | 17518 | 487 | 1059 |

3. | Input | 0 | 9404 | 167 | 486 |

4. | Output | 0 | 2455 | 114 | 221 |

5. | Enquiry | 0 | 952 | 62 | 105 |

6. | File | 0 | 2955 | 91 | 210 |

7. | Interface | 0 | 1572 | 24 | 85 |

8. | Added | 0 | 13,580 | 360 | 830 |

9. | Changed | 0 | 5193 | 85 | 291 |

10. | Deleted | 0 | 2657 | 12 | 124 |

11. | PDR_AFP | 0.3 | 83.8 | 12 | 12 |

12. | NPDR_AFP | 0.4 | 101 | 13 | 14 |

13. | NPDU_UFP | 0.4 | 108 | 14 | 15 |

14. | Resource | 1 | 4 | 1 | 1 |

15 | PDR_UFP | 0.3 | 83.8 | 12 | 12 |

16. | Dev. Type | 0 | 0 | 0 | 0 |

17. | Duration | 1 | 84 | 9 | 7 |

18. | N_effort | 31 | 54620 | 4278 | 7071 |

19. | Efforts | 26 | 54260 | 3921 | 6481 |

efforts, consequently making the model much less difficult and productive. It has been seen from the China informational index that independent variables ID and Dev.Type do not play any role in deciding the value of effort. Consequently, variables ID and Dev are autonomous. Here we perform Cross-validation, a standard evaluation method that is an orderly method for running repeated percentage splits. It consists of partitioning a dataset into 10 pieces (“folds”); at that point hold out each piece for testing and train on the 9 staying together. This gives 10 assessment results, which are the average.

There are different strategies utilized for diminishing information dimensionality. We have utilized the Feature sub-selection procedure given in the WEKA tool [

Mean Magnitude of relative error (MMRE) (or mean absolute relative error) currently utilizes the most effective and standard measures for estimation exactness, for example, MMRE and PRED at power levels 0.25, 0.50 and 0.75, respectively.

In order to assess capability, we use a common criterion called Mean Magnitude of Relative Error (MMRE) [

MMRE = 1 k ∑ i = 1 k | E i − A i | A i (1)

here, E_{i} represents the estimated value for a data point, A_{i} represents the actual value of each data point, and k is the total number of data points. Here, Predict(A) is calculated as follows:

Predict ( A ) = m k (2)

Mean Relative Error (MRE) is denoted by (m) and includes the values when data points have less than or equal to A error. It is common to consider (25%) as the reference value [

Relative Absolute Error (RAE) calculates the accuracy of a predictive model. RAE can be used in machine learning. Furthermore, RAE is expressed as the ratio; it computes the mean error (residual) of errors produced by a trivial or naive model. The model is considered non-trivial if the result is less than 1. This is the model for a data set (k):

R k = ∑ i = 1 n | E k i − D i | ∑ i = 1 n | D i − D ¯ | (3)

where E_{i}’s is prediction, D_{i}’s is actual values, and Rae is the measure of forecast accuracy. D ¯ is the mean of D_{i}’s; n is the size of the dataset (in data points) [

Root relative squared error (RRSE) takes the average errors, squares them and normalizes the average. Then, in order to maintain the error to the same dimension, the square root is calculated. In the following equation, E_{i} of an individual model i is shown:

R k = ∑ j = 1 n ( E k i − D i ) 2 ∑ j = 1 n ( D j − D ¯ ) 2 (4)

where E_{i}’s is prediction, D_{i}’s is actual values, Rae is the measure of forecast accuracy, D ¯ is the mean of D_{i}’s, and n is the size of the dataset (in data points) [

Relative Absolute Error (RAE) calculates the accuracy of a predictive model. RAE can be used in machine learning. Furthermore, RAE is expressed as a ratio that computes the mean error (residual) of errors produced by a trivial or naive model. The model is considered non-trivial if the result is less than 1. This is the model for a data set (k):

R k = ∑ i = 1 n | E k i − D i | ∑ i = 1 n | D i − D ¯ | (5)

where E_{i}’s is prediction, D_{i}’s is actual values, Rae is the measure of forecast accuracy. D ¯ is the mean of D_{i}’s, and n is the size of the dataset in data points.

Root relative squared error (RRSE) [_{i} of an individual model i is shown:

R k = ∑ j = 1 n ( E k i − D i ) 2 ∑ j = 1 n ( D j − D ¯ ) 2 (6)

where E_{i}’s is prediction, D_{i}’s is actual values, and Rae is the measure of forecast accuracy. D ¯ is the mean of D_{i}’s; n is the size of dataset in data points.

Right now, we are utilizing machine learning methods to predict effort. We have used ensemble learning method bagging with base Learner Linear Regression, Support Vector Machine, Neural Network (MLP), MRules 5, REPTree, and Random Forest.

Linear regression (LR) is widely used for predictive analysis. Basically, LR measures the degree to which variables are linearly related. The formula for linear regression is:

y = β 0 + β 1 x 1 + β 2 x 2 + ⋯ + β n x n (7)

Furthermore, multiple linear regression (MLR) is an empirical model that utilizes data from past results. According to Liung and Fan MLR is an empirical model that utilizes data from past results in order to measure current results [

y i = β 0 + β 1 x i 1 + β 2 x i 2 + ⋯ + β n x i . n + ϵ (8)

where y_{i} is the dependent variable, x_{i} is the explanatory variable, β_{0} is the y-intercept (constant), β_{p} is slope coefficient for each explanatory variable, and ϵ is the model’s residuals (errors).

A multilayer perceptron (MLP) is used for regression; however, in MLP an intermediate layer (hidden layer) is used instead of using input as feed. MLP has several layers of nodes; it should have at least three layers of nodes: an input, a hidden layer and an output layer. MLP is a logistic model that can be in various depth level based on the number of intermediate layers, as shown in these equations: y ( v i ) = tanh ( v i ) and y ( v i ) = 1 + e − v i . Here, y(v_{i}) is the output of the i^{th} node (neuron), and v_{i} is the weighted sum of the input connections

Sequential minimal optimization (SMO) is useful when applied to solve quadratic programming (QP) problem that comes out as a result of applying the training of support-vector machines. When looking at a binary classification problem with a dataset pairs (x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n}), where x_{i} is an input vector and y_{i} can be a label of either (−1) or (+1). A soft-margin support vector machine is then trained by solving a quadratic programming problem, which is expressed in form:

m a x α = ∑ i = 1 n α i − 1 2 ∑ i = 1 n ∑ j = 1 n y i y j K ( x i , x j ) α i α j (9)

where K(x_{i}, x_{j}) is the kernel function; α’s are Lagrange multipliers, (Platt 1998). However, it must satisfy the two conditions 0 ≤ α i ≤ C (for i = 1 , 2 , ⋯ , n ) and ∑ i = 1 n y i α i = 0 (here C is a support-vector machine hyperparameter).

REPTree is based on regression tree logic. It basically creates a number of trees; the process is done on different iterations. Then, REPTree selects the best one from all generated trees. The selection will be considered representative. Then the mean square error is applied on the trees’ prediction [

Decision tree is a regression methodology. Basically, it provides an easily understandable modelling technique. Moreover, even if there are some imperfections in the data, such as missing values, it can predict patterns to overcome such issues. Decision tree is an approach that uses continuous recursive partitioning until it reaches classification of a dataset data. It can be of two types, breadth first (BF) or depth first (DF). In both a greedy algorithm is typically applied. One drawback of decision trees is overfitting of data samples [

Bagging is a method to stabilize the accuracy of machine learning. Bagging helps reduce the issue of overfitting found in some regression methodologies such as decision tree. Hence, it is a method that iteratively samples from a certain data set according to a rectangular probability distribution with substitution [

Random decision forest (RDF) was first introduced by Ho [

In the literature we have found some limitations and from our experiment, we discovered that most researchers ignore the steps in their Pre-processing. Before pre-processing remove the missing and noisy data from the dataset. Besides these limitations, attribute selection is another important limitation that directly affects the memory use and also affects the results. So, to overcome these limitations we follow these basic steps (see

○ Input as a dataset

○ Preprocess the data:

• by filtering out noisy and missing data,

• by conversion,

• by removing outlier.

○ Apply Feature Selection Method (CfsSubsetEval: 1) Genetic Algorithm, 2) BestFit)

○ Ensemble Learning Methods

○ Computing the results

Figures 3(a)-(f) show the same prediction experiment when the same six techniques are applied when genetic algorithms are applied.

The results are shown in

Feature Selection Algorithm | ML Algorithm | MMRE | PRED 25 | PRED 50 | PRED 75 |
---|---|---|---|---|---|

BEST FIT | Bagging LR | 0.147558 | 0.88 | 0.946667 | 0.98 |

Bagging SMOReg | 0.126655 | 0.911824 | 0.963928 | 0.983968 | |

Bagging MLP | 0.176172 | 0.8 | 0.906667 | 0.96 | |

Bagging MRules5 | 0.10263 | 0.97333 | 0.98 | 0.99333 | |

Bagging REPTree | 0.153349 | 0.88 | 0.95333 | 0.98 | |

Bagging RF | 0.251015 | 0.78 | 0.88 | 0.93333 | |

Genetic Algorithm | Bagging RF | 0.318897 | 0.74 | 0.82 | 0.9 |

Bagging REPTree | 0.152987 | 0.88 | 0.95333 | 0.98 | |

Bagging M5 Rule | 0.10006 | 0.97333 | 0.98 | 0.99333 | |

Bagging LR | 0.193831 | 0.8133 | 0.9133 | 0.946667 | |

Bagging MLP | 0.157302 | 0.8733 | 0.966667 | 0.966667 | |

Bagging SMOReg | 0.128885 | 0.906667 | 0.966667 | 0.973333 |

Results | Algorithms | MMRE-genetic algorithm | MMRE-bestfit | PRED (25) GA/ Pred (25) BF |
---|---|---|---|---|

Our | Bagging RF | 0.251015 | 0.147558 | 0.8133/0.78 |

Bagging REPTree | 0.153349 | 0.126655 | 0.88/0.88 | |

Bagging M5 Rule | 0.10263 | 0.176172 | 0.9733/0.9733 | |

Bagging LR | 0.14755777 | 0.10263 | 0.8133/0.88 | |

Bagging MLP | 0.176172 | 0.153349 | 0.87/0.80 | |

Bagging SMOReg | 0.126655 | 0.251015 | 0.9066/0.9118 | |

[ | Augmented COCOMO | 0.65 | 0.65 | Pred (20) 0.3167 |

Parsimonious COCOMO | 0.64 | 0.304 | ||

[ | Clustering | 0.0103 | 0.0103 | Pred (30) 0.356 |

[ | Regressive | 0.623 | 0.623 | - |

ANN | 0.352 | 0.352 | - | |

Case Based Reasoning | 0.362 | 0.362 | - | |

[ | SVR | 0.165 | 0.165 | 0.8889 |

RBF | 0.1907 | 0.1906 | 0.7222 | |

Linear Regression | 0.233 | 0.233 | 0.7222 | |

[ | ANN | 0.900 | 0.900 | 0.22 |

Classification and Regression Tree | 0.770 | 0.770 | 0.26 | |

Ordinary Least Square Regression | 0.720 | 0.720 | 0.33 | |

Adjusted analogy-based estimation using Euclidean distance | 0.3800 | 0.3800 | 0.57 | |

Adjusted analogy-based estimation using Manhattan distance | 0.360 | 0.360 | 0.52 | |

Adjusted analogy-based estimation using Minkowski distance | 0.430 | 0.430 | 0.61 |

predicted bands, the lower the error and the better the model. The charts show that the real and the predicted qualities are exceptionally near one another.

Answer to Research Questions

RQ1: Impact of feature selection algorithm on different ensemble learning algorithms.

If we look at

RQ2. What is the performance of ensemble Learning Techniques?

The performance of bagging M5Rule is best among all the algorithms, even for feature selection cases; at the same time, they are highly correlated in the comparative analysis available in

In this study we perform comparative analysis among twelve ensemble methods for predicting the efforts. We use the Promise data set repository for predictions. The data set contains nineteen (19) features so we use two feature selection methods named BestFit and Genetic Algorithm on six ensemble learning algorithms. The results show that the Genetic Algorithm feature selection for the bagging M5 rule is the best method for predicting efforts with MMRE value 10%, and PRED (25), PRED (50) and PRED (75) have values 97%, 98% and 99%, respectively. In the future researchers can use Ensemble Learning with different feature selection methods for predicting efforts estimation. Hence, the ensemble learning method shows ability for predicting efforts.

The authors declare no conflicts of interest regarding the publication of this paper.

Alhazmi, O.H. and Khan, M.Z. (2020) Software Effort Prediction Using Ensemble Learning Methods. Journal of Software Engineering and Applications, 13, 143-160. https://doi.org/10.4236/jsea.2020.137010