A Neuro-Based Software Fault Prediction with Box-Cox Power Transformation

Software fault prediction is one of the most fundamental but significant management techniques in software dependability assessment. In this paper we concern the software fault prediction using a multilayer-perceptron neural network, where the underlying software fault count data are transformed to the Gaussian data, by means of the well-known Box-Cox power transformation. More specially, we investigate the long-term behavior of software fault counts by the neural network, and perform the multi-stage look ahead prediction of the cumulative number of software faults detected in the future software testing. In numerical examples with two actual software fault data sets, we compare our neural network approach with the existing software reliability growth models based on nonhomogeneous Poisson process, in terms of predictive performance with average relative error, and show that the data transformation employed in this paper leads to an improvement in prediction accuracy.


Introduction
As software systems play an increasingly important role in our lives, their complexity and size continue growing.The increased complexity of software systems makes the assurance of software quality much difficult.In fact many of software applications require critical functionality because of their increasing size and complexity.Software reliability is an important facet of software quality, and is defined as the probability of failure-free software operation for a specified period of time in a specified environment.For the purpose of quantitative assessment, software reliability growth models (SRGMs) have been widely used during the last four decades [1] [2] [3] [4] [5].It is worth mentioning that SRGMs specify any parametric form of random processes that describe the behavior of software failure occurrence with respect to time.Since SRGMs are essentially stochastic models with abstractions, they must be built under several simplified mathematical assumptions, and, at the same time, their parameter estimation with the fault count data observed in software testing is not a trivial task.Because the maximum likelihood estimation, which is commonly used, is reduced to a multimodal nonlinear optimization problem with constraints, and requires more computation efforts.Another problem on SRGMs is the model selection from a great number of SRGM candidates.During the last four decades, over three hundred SRGMs have been proposed in the literature.The conclusion from the empirical research suggests that the best SRGM in terms of the goodness-of-fit performance depends on the kind of software fault count data.In other words, there does not exist the best SRGM which can fit every software fault count data.Sharma et al. [6] proposed a selection method of the best SRGM by using the concept of distance.Unfortunately, it is noted that the best SRGM which can fit the past observation experienced before does not always provide the best prediction model for the future (or remaining) testing period.
Apart from SRGMs based on stochastic modeling, artificial neural network (ANN) has gained much popularity to deal with non-linear phenomena arising in applications to time series forecasting, pattern recognition, function approximation, etc. Comparing with non-trivial stochastic models, it is easy to implement the ANN for the software fault prediction, since the feed-forward backpropagation (BP) type of learning algorithm can be widely used to estimate the internal parameters, such as connection weights.Software fault prediction using the ANN was proposed first by Karunanithi et al. [7], Karunanithi and Malaiya [8] [9].They applied simple multi-layered perceptron (MLP) feed forward neural networks, which enjoy a universal approximation ability [10] to represent an arbitrary nonlinear mapping with any degree of accuracy, in the prediction of software fault-detection time in software testing.Since their seminal contributions, the ANN approach has been frequently applied to different estimation/ prediction problems in software engineering.Khoshgoftaa et al. [11] [12], Khoshgoftaar and Szabo [13] considered different problems to identify faultprone modules in software quality assessment.For more recent survey, see Vashisht et al. [14].
While, it should be pointed out that the ANN approach has some drawbacks in application to software fault prediction.First, there is no efficient way to determine the best neural network architecture in each application domain.Even though the number of input neurons and output neurons may be determined from physical requirements, the number of hidden layers and hidden neurons significantly influences the prediction performance in MLP feed forward neural networks.In many applications, these sizes must be determined through trialand-error heuristics in pre-experiments.In other words, the predictive performance of software fault count strongly depends on the ANN architecture assumed for the analysis.Second, the ANN is a kind of nonlinear regression mod-el, but can be regarded as a deterministic model to output the deterministic values as estimates or predictions.Dissimilar to the familiar SRGMs, it is impossible to quantify the software reliability as a probability by applying the common ANN approach.This feature penalizes us to use the ANN when quantifying the software reliability measures such as the software reliability, mean time to software failure, etc.On the other hand, though the ANN is a simple connectionist model depending on the architecture, it can be considered as a statistically nonparametric model without specific model assumptions.As mentioned above, a huge number of SRGMs have been developed in the literature [15]- [25], but almost all of them are based on some parametric assumptions which cannot be validated for every fault count data.In that sense, the ANN approach can be viewed as one of nonparametric models with no specific model assumptions.In fact, the ANN approach provides a data-driven modeling framework and can bridge several kinds of machine learning techniques.Yang et al. [26] applied a model mining technique to provide a generic data-driven SRGM.Xiao and Dohi [27] proposed a nonparametric wavelet method to estimate nonhomogeneous Poisson process (NHPP)-based SRGM.Cheng et al. [28] considered a multistepahead a time series prediction with the MLP approach.Park et al. [29] also compared several data-driven approaches with the existing SRGMs.Recently, Begum and Dohi [30] [31] applied an idea on multiple-input multiple-output (MIMO) neural computation by Park et al. [29], and proposed a refined neural network approach to predict the long-term behavior of software fault count with the grouped data.They impose an assumption that the underlying fault count process obeys the Poisson law with an unknown mean value function, and propose to utilize several data transform methods from the Poisson count data to the Gaussian data.However, it is worth noting that even the Poisson assumption can be regarded as a simplified assumption in SRGM modeling and has not been validated empirically.
In this paper we refine the existing MLP approach as a data-driven model from the view point of software fault prediction.In particular, we deal with the grouped data which consists of the number of software fault counts detected at each testing date, although the existing MLP approaches focus on only the software fault-detection time data which are not easily available in actual software testing.We apply the well-known data transformation technique called the Box-Cox power transformation [32], from an arbitrary probability law to the Gaussian law, and transform the underlying software fault prediction problem to a nonlinear Gaussian regression problem with the MLP.First, Tukey [33] introduced a family of power transformations such that the transformed values obey a monotonic function of the observations over some admissible range, where an arbitrary transformation parameter is involved in the power transformations.[32] proposed the maximum likelihood as well as the Bayesian methods for estimation of the parameter, and derived the asymptotic distribution of the likelihood ratio to test some hypotheses about the parameter.The main contributions by Box and Cox [32] are two-folds: The first one is to com-pute the profile likelihood function and to obtain an approximate confidence interval from the asymptotic property of the maximum likelihood estimator.The second one is to ensure that the probability model is fully identifiable in the Bayesian approach.In neural computation contexts, Dashora et al. [34]  [39] [40] just considered the one-stage look ahead prediction.Hence, our challenge with MIMO and the Box-Cox power transformation overcomes the limitation for the existing neuro-based approaches.Recently, the same authors [30] considered a different transformation technique for the software fault prediction, where only the one-stage look ahead prediction is made in terms of prediction interval.Also, though they implicitly assume the Poisson law for the underlying fault count data, we do not restrict the Poisson law for the software fault count data, because the Box-Cox transform is a general data transformation scheme.The present paper is organized as follows.In Section 2, we overview the SRGMs based on nonhomogeneous Poisson process (NHPP).Section 3 introduces a refined ANN for the purpose of prediction, where the Box-Cox power transformation is applied in the pre-processing of input data.In Section 4, we analyze two actual software fault count data sets and compare our refined neural network approach with eleven NHPP-based SRGMs [41] from the view point of predictive performance with relative average error.We show here that our ANN approach affords a more appropriate prediction device and tends to have an enhanced performance from the standpoint of predictability.Finally the paper is concluded with remarks in Section 5.

NHPP-Based Software Reliability Modeling
We summarize the software reliability growth modeling with the nonhomogeneous Poisson process (NHPP).Suppose that software system test starts at time 0 t = .Let ( ) X t denote the cumulative number of software faults detected by time t, where ( )

{ }
, 0 X t t ≥ means a stochastic (non-decreasing) counting process in continuous time.In particular, it is said that ( ) X t is an NHPP if the following conditions hold: • ( )  Suppose that n realizations of are observed up to the observation point t.We estimate the model parameter θ by means of the maximum likelihood method.Then, the log likelihood function for the grouped data ( )( ) where ( )  ( ) ( ) ( ) ( ) ( ) ( ) Log logistic (llogist) [19] ( ) ( ) Truncated extreme value maximum (txvmax) [22] ( ) ( ) ( ) ( ) Log extreme value maximum (lxvmax) [22] ( ) ( ) Truncated extreme value minimum(txvmin) [22] ( ) ( ) ( ) ( ) Log extreme value minimum (lxvmin) [18] [22] ( ) ( ) ( ) When the mean value function is unknown, a few nonparametric approaches have been developed [43] [44].However, it should be noted that those approaches can deal with the fault-detection time data, but do not work for prediction in future.The wavelet-based method in [27] can treat the grouped data, but fails to make the long-term prediction in nature.In the following section, we use an elementary MLP for the purpose of the long-term software fault prediction.

A Refined MLP Architecture
Artificial neural network is a computational metaphor inspired by the brain and nervous system study, and consists of an input layer with some inputs, multiple hidden layers with hidden neurons and one output layer.The input layer of neurons can be used to capture the inputs from the outside world.Since the hidden layer of neurons has no communication with the external world, the output layer of neurons sends the final output to the external world.Hence, determining an appropriate number of hidden neurons is an important design issue in neural computation.In this paper we consider an MIMO type of MLP with only one hidden layer.Similar to Section 2, suppose that n software fault count data ( )( ) are observed at the observation point ( ) Our concern is about the future prediction of the cumulative number of software faults at time ( )

Preliminary Set-Up
In the common neural computation, it is noted that the neural network including the simplest MLP with only one output neuron is regarded as a nonlinear regression model, where the explanatory variables are randomized by the Gaussian white nose.In other words, the output data in the MLP is implicitly assumed to be a realization of a nonlinear Gaussian model.On the other hand, since one handles the fault count data as integer values in the software fault prediction, the underlying data shall be transformed to the Gaussian data in advance.Such a pre-data processing is common in the wavelet shrinkage estimation [27] although it specifies the underlying data as the Poisson data.According to the idea in the literature [27], we apply the Box-Cox power transformation technique [32] from an arbitrary random data to the Gaussian data.As mentioned in Section 1, Box and Cox [32] developed a procedure to identify an appropriate exponent λ to transform data into a "normal shape".Table 2 presents the Box-Cox power transformation and its inverse transform formula.In this table, i x denotes the cumulative number of software faults detected at ( ) the power to which all data should be raised, where the parameter λ has to be adjusted in the Box-Cox power transformation.It is common to determine the optimal λ in the pre-experiments before time series prediction.

Let
( ) ( ) x l + = be the input and output for the MIMO type of MLP, respectively.Then, the prediction of the cumulative number of software faults is given by the inversion of the data transform.Figure 1 depicts the architecture of back propagation type MIMO, where n is the num- ber of software fault count data experienced before the observation point n t and l is the prediction length.We suppose that there is only one hidden layer with ( ) hidden neurons in our MIMO type of MLP.

Training Phase
Suppose that all the connection weights ( nk weights from input to hidden layer, kl weights from hidden to output layer in Figure 1) are first given by the uniformly distributed pseudo random varieties.In the MIMO, if these weights are completely known, then it is possible to calculate ( ) from the input ( ) 1 , , n x x directly.However, since it is impossible to train all the weights in- cluding ( ) k n l + unknown patterns in principle via the common BP algorithm, it is needed to develop a new long-term prediction scheme for the MIMO.In short, we briefly introduce the long-term prediction scheme developed by Begum and Dohi [31].Suppose that n l > without any loss of generality.In Fig- ure 2, we illustrate the configuration of our prediction scheme.In order to predict the cumulative number of software faults for l testing days from the observation point n t , the prediction has to be made at the point n l t − .This implies that only ( ) n l k kl nl − + = weights can be estimated with the training data ex- perienced for the period ( ]   .
Since there is no universal method to determine the number of hidden neurons, we change k in the pre-experiments and choose an appropriate value.
After calculating j h for each hidden neuron, we apply a sigmoid function Because s x are also summative and weighted inputs from respective hidden neurons in the output layer, the weight js w′ is connected from j-th hidden neuron to s-th output neuron.The output value of the network in the training phase, s x is calculated by ( ) ( ) − .In the BP algorithm, the error is propagated from an output layer to a successive hidden layer by updating the weights, where the error function is defined by with the prediction value s x and the teaching signal o s x observed for the pe- riod ( ] x where each connection weight is adjusted using the gradient descent algorithm according to the contribution to SSE in Equation ( 8).
The momentum, α, and the learning rate, η, are controlled to adjust the weights and the convergence speed in the BP algorithm, respectively.Since these are the most important tuning parameters in the BP algorithm, we carefully examine these parameters in pre-experiments.In this paper we set α = 0.25 ~ 0.90 and η = 0.001 ~ 0.500.Then, the connection weights are updated in the following: respectively.Also, the updated bias weights for hidden and output neurons are respectively given by ( ) ( ) The above procedure is repeated until the desired output is achieved.

Prediction Phase
Once the nl weights are estimated with the training data experienced for the period ( ] , are calculated by replacing Equations ( 6) and ( 7) by , where m = 1000 is confirmed to be enough in our preliminary experiments.In other words, the prediction in the MIMO type of MLP is reduced to a combination of the BP learning and a Monte Carlo simulation on the connection weights.

Data Sets
We use two real project data sets cited in the reference [2]; DS1 and DS2, which consist of the software fault count (grouped) data.In these data sets, the length of software testing and the total number of detected software faults are given by (62, 133) and (41,351), respectively.To find out the desired output via the BP algorithm, we need much computation cost to calculate the gradient descent, where the initial guess of weights, ij w , js w′ , 0 j w and 0s w′ , are given by the uniform random variates ranged in [ ] 1, 1 − + , the number of total iterations in the BP algorithm run is 1000 and the convergence criteria on the minimum error is 0.001 which is same as our previous paper [30].In Figure 3 and Figure 4, we give two examples on how to determine the optimal transformation parameter * λ .In our experiments, it is shown that the search range of λ should be [ ]

Predictive Performance Criterion
Suppose that the observation point is given by the n -th testing day, n t .In this case, ( )

Results
Tables 3-6 summarize the results on AE for the underlying data set DS1 at 50% ~ 90% observation points of the whole data for the prediction length l = 5, 10, 15 and 20 days, where * λ denotes the optimal transformation parameter in the sense of minimum AE, and the bold number implies the best prediction model in the same category.For instance, Table 3 gives the prediction results on the cumulative number of software faults for 5 days prediction at respective observation points, when the number of hidden neurons changes from k = 10 to 50.
In the MIMO type of MLP neural network, we compare the Box-Cox power transformation with the non-transformed case (Normal) and the best SRGM in we expected first that the larger k may lead to the better predictive performance.
However, it is not true from the results in Tables 4-6.In the MIMO-based approach, it is essential to determine feasible k and λ values, because the number of hidden neurons results the expensive computation cost with different prediction length l.In the original paper by Box and Cox [32] they suggest that "fix one, or possibly a small number, of λ's and go ahead with the detailed estimation".In their examples, they use what is usually called "snap to the grid" method to choose the transformation parameter.Unfortunately, no universal method to determine the optimal λ has not been reported yet in the literature.Hence, it is needed to give an appropriate λ even though it is not optimal.From Figure 3 and Figure 4, it can be recognized that the adjustment of λ is quite sensitive to the predictive performance and has to be done through the try-and-error heuristics.However, for an arbitrary λ, we can know that the multi-stages look ahead

Concluding Remarks
In this paper we have investigated an applicability of the Box-Cox power transformation to the neuro-based software fault prediction.The ANN employed in this paper is an MIMO type of MLP, and can handle the grouped data on software fault counts as well as make the long-term prediction.To our best knowledge, this paper is the primary challenge to treat the long-term prediction of software faults with the grouped data in the ANN approach.Throughout a comprehensive comparison with the existing SRGMs, it has been shown that our MIMO type of MLP could work well to predict the cumulative number of software faults in the early testing phase.In the future, we will apply the proposed neural network approach to the other software fault count data and conduct more comprehensive data analysis to validate our method with data transformation.Especially, a challenging issue is to develop the prediction interval of the cumulative number of software faults.Even if SRGMs are assumed, it is almost impossible to get the exact predictive intervals of the cumulative number of software faults without applying any approximation method.We will extend our prediction scheme based on the MIMO type of MLP to the interval prediction problem.We will also consider how to select the optimal transformation parameter in the Box-Cox transformation.
used the Box-Cox power transformation for data driven prediction models with single output neuron in MLP and analyzed the intermittent stream flow for Narmada river basin.They compared the MLP with seasonal autoregressive integrated moving average and Thomas-Fiering models.Sennaroglu and Senvar [35] evaluated the process capability indices in industry and compared the Box and Cox power transformation with weighted variance methods in the Weibull distribution model in terms of the process variation with the product specifications.In this way, an applicability of the Box and Cox power transform can be recognized in many research fields.In this paper we transform the discrete integer-valued data which denote the cumulative number of software faults detected in software system testing to the Gaussian data by the Box-Cox transform.Next we input the transformed data into the MIMO MLP to make the long-term prediction of the software fault count.Note that almost all papers in past, [7] [8] [9] [11] [12] [13] [36] [37] [38] the identification problem of the NHPP is reduced to a statistical estimation problem of the unknown model parameter θ.In this way, when the parametric form of the mean value function or the intensity function is given, the resulting NHPP-based SRGMs are called parametric NHPP-based SRGMs.Table 1 contains the representative NHPP-based SRGMs and their mean value functions.Okamura and Dohi [41] summarized these eleven parametric NHPPbased SRGMs and developed a parameter estimation tool, SRATS, based on the maximum likelihood method and the EM (Expectation-Maximization) algorithm.In SRATS, the best SRGM with the smallest AIC (Akaike Information Criterion) is automatically selected, so the resulting best SRGM can fit best the past observation data on software fault counts among the eleven models.
The maximum likelihood estimate of model parameter θ , can be obtained by maximizing Equation (3) with respect to the model parameter θ.Once the model parameter is estimated, our next concern is to predict the future value of the intensity function or the mean value function at an arbitrary time denotes the prediction length [42].In parametric modeling, the prediction at time n l t + is easily done by substituting estimated model parameter θ into the time evolution ( ) ; t θ Λ , where the unconditional and conditional mean value functions at an arbitrary future time n l t + are given by

,
weights in this paper.As the prediction length is longer, the number of non-estimable weights becomes greater and the prediction uncertainty also increases more.In this scheme, the Box-Cox transformed data ( ) 1 , , n l x x − with given λ are used for the input in the MIMO, and the remainthe teaching signals in the training phase.The BP algorithm is the well-known gradient descent method to update the connection weights, so as to minimize the squared error between the network output values and the teaching signals.For the value coming out an input neuron, is common to add two special inputs; bias units

Figure 1 .
Figure 1.Architecture of back propagation type MIMO.

Figure 2 .
Figure 2. Configuration of prediction scheme via MIMO.
function in the MIMO.Since j h are sum- mative and weighted inputs from respective hidden neurons, the s-th output ( ) the BP algorithm.It updates the weight parameters so as to minimize SSE between the network output values

δ
are the output gradient of j-th hidden neuron and the output gradient in the output layer, and are defined by

,
algorithm, we need to obtain the remaining ( ) k n l nl + − non-estimable weights for prediction.Unfortunately, since these cannot be trained with the information at time n t , we need to give these values by the uniform pseudo random variates in the range [ ] 1,1 − .By giving the random connection weights, the output as the prediction of the cumulative number of software faults, ( ) l − software fault counts data are used for training the MIMO type of MLP.The capability of the prediction model is measured by the average error (is called the relative error for the future time t n s = So we regard the prediction model with smaller AE as a better prediction model.
1, 2, , i n -th testing day.Then, we have the transformed data i x by means of the Box-Cox power transformation.The transformation parameter λ indicates

Table 1 .
In the column of SRGM, we denote the best SRGMs in terms of predictive performance (in the sense of minimum AE) and estimation performance (in the sense of minimum AIC) by P and E, respectively.
It is seen that our MIMO-based approaches provide smaller AEs than the common SRGMs in almost all cases when the observation points are 50% and 70%.In the 60% observation point, the best prediction model is the non-transformed MIMO (Normal) with k = 50.On the other hand, in the latter phase of software testing, i.e., 80% ~ 90% observation points, SRGMs, such as txvmax and txvmin, offer less AEs than the MIMO type of MLPs.Even in these cases, it should be noted the best SRGM with the minimum AIC is not always equivalent to the best SRGM with the minimum AE.This fact tells us that one cannot know exactly the best SRGM in advance in terms of predictive performance.Comparing the data transform methods with the non-transformed one, we can find only one case where Normal provides the best prediction result in Table3for DS1 and Table8for DS2.However, in the other early prediction phases, it is seen that the data transform can work well to give more accurate prediction results in the MIMO type of MLPs.Tables 7-10 present the prediction results on AE for DS2.Similar to DS1, the MIMO type of MLPs can predict the cumulative number of software faults more accurately in the early testing phase, say, 50% ~ 60% observation points, than SRGMs.Focusing on the number of hidden neurons in the MIMO type of MLPs,

Table 3 .
Comparison of average relative errors for five days prediction with DS1 (l = 5).

Table 4 .
Comparison of average relative errors for ten days prediction with DS1 (l = 10).

Table 5 .
Comparison of average relative errors for fifteen days prediction with DS1 (l = 15).

Table 6 .
Comparison of average relative errors for twenty days prediction with DS1 (l = 20).

Table 7 .
Comparison of average relative errors for five days prediction with DS2 (l = 5).

Table 8 .
Comparison of average relative errors for ten days prediction with DS2 (l = 10).

Table 9 .
Comparison of average relative errors for fifteen days prediction with DS2 (l = 15).

Table 10 .
Comparison of average relative errors for twenty days prediction with DS2 (l = 20). of software fault count is possible with the MIMO type of MLP and that the Box-Cox power transformation improves the predictive accuracy, especially, in the early prediction phase. prediction