^{1}

^{*}

^{1}

^{2}

^{*}

The bug tracking system is well known as the project support tool of open source software. There are many categorical data sets recorded on the bug tracking system. In the past, many reliability assessment methods have been proposed in the research area of software reliability. Also, there are several software project analyses based on the software effort data such as the earned value management. In particular, the software reliability growth models can apply to the system testing phase of software development. On the other hand, the software effort analysis can apply to all development phase, because the fault data is only recorded on the testing phase. We focus on the big fault data and effort data of open source software. Then, it is difficult to assess by using the typical statistical assessment method, because the data recorded on the bug tracking system is large scale. Also, we discuss the jump diffusion process model based on the estimation method of jump parameters by using the discriminant analysis. Moreover, we analyze actual big fault data to show numerical examples of software effort assessment considering many categorical data set.

Many open source software (OSS) are used in various areas of mobile devices, IoT, server-side application, cloud computing, edge computing and database software. The development paradigm of OSS is different from the typical software development style. In particular, the maintenance phase is the fault-data-driven fixing style by using the bug tracking system. The big fault data sets are recorded on the OSS bug tracking system. Then, the size of fault data is approximately over tens of thousands lines. However, it is very difficult to assess the OSS reliability by using the typical statistical method, because the size of data recorded on the bug tracking system is very large scale. As an example, there is the problem for the degrees of freedom in case of the statistical approach. Generally, the degree of freedom is the number of data. Therefore, it is very difficult to decide the degrees of freedom in case of the big data. In this paper, we propose the fusion-method of statistical and stochastic modeling approaches.

The traditional reliability assessment methods based on software reliability growth models have been proposed by several research groups [

We propose the fusion-method project assessment based on quantification method of second type and jump diffusion process model. Then, we can resolve the statistical problem in case of the large scale data analysis such as the big fault data by using our method. Moreover, we show several analysis examples based on the proposed method by using the actual big fault data.

We show the organization of this paper. First, Section 2 proposes the linear discriminant analysis for big fault data. Then, several categorical data are analyzed by using the actual fault data. Moreover, we discuss the jump diffusion process model as the stochastic approach. Section 3 shows the fusion-method of the statistical and stochastic modeling approaches by using the actual OSS fault data. Section 4 summarizes the characteristics of our method.

In terms of the jump term in the jump diffusion process, it is difficult to estimate the unknown parameters of jump term, because the jump diffusion process model has the different stochastic processes. In particular, the jump diffusion process model has two stochastic processes consisting of the Wiener process and jump diffusion one. Also, the jump diffusion process model is assumed that the Wiener process is independent of jump diffusion process. Therefore, we can define the parameter of jump term individually. Then, we propose the estimation method of jump term parameters by using the linear discriminant analysis according to the following procedure.

Step 1: Generally, it is difficult to understand the total trends of big fault data. Therefore, we apply the linear discriminant analysis in order to confirm the correlation of the specified factor for all factor. Thereby, we can understand the mutual interaction in the big fault data.

Step 2: Then, we focus on the contribution rate based on the analysis results by linear discriminant analysis. In particular, the contribution rate is the important measure, because the contribution rate means the changing rate of each factor for changes in the entire data. Therefore, we consider that the estimates of jump term parameters by using the contribution rate will be useful to assess the reliability considering the characteristic of big fault data.

Step 3: We estimate the mean and variance of contribution rates for all factors from the analysis results of big fault data. Then, we apply the mean and variance of contribution rates to the unknown parameters of jump term.

Step 4: Then, we can show several reliability assessment measures based on the jump diffusion process model.

We focus on Fisher’s linear discriminant analysis. Considering the linear discriminant analysis, it is assumed that the applied data is satisfied the following conditions:

1) The data is based on the normal distribution.

2) Each class has the same covariance matrix.

3) Variables are independent each other.

We show analysis examples by using the Apache HTTP Server Project [

We show the numerical examples based on linear discriminant analysis in Figures 4-13, respectively. In particular, we discuss the estimation results of Figures 4-13 as follows:

Product: The level of uniformity is high. However, the data has two factors only.

Component: Two clusters are structured by calculation. In particular, the group with core component and main component is placed to left side, the right side group is the small size component such as Other and sub-component.

Version: As with Component, two clusters are structured by the analyzation. The right side cluster becomes large. In particular, the faults group of newly version is placed to the right side cluster.

Reporter: Four clusters are composed by the analyzation. We can consider that this is the unbiased result. This software has been reported by various un-uniformed reporters.

Severity: Two clusters are estimated. In particular, the right-bottom cluster becomes large. The clusters of Reporter and Severity may be the same situation, because two clusters are the same shape.

Status: We cannot find the characteristics from this figure.

Resolution: There are three types of cluster.

Hardware: There are two types of cluster. In particular, we found that the Reporter, Severity, and Hardware show the same tendency.

OS: The specified factor has biased.

Summary: The level of uniformity is high.

In this paper, we analysis the highest contribution rate for all factors, because the contribution rate means the changing rate of each factor for changes in the entire data.

Maximum value | Minimum value | |
---|---|---|

Product | 0.82721 | 0.17279 |

Component | 0.59082 | 0.00110 |

Version | 0.87069 | 0.00088 |

Reporter | 0.52753 | 0.00069 |

Severity | 0.56814 | 0.00366 |

Status | 0.72707 | 0.00088 |

Resolution | 0.63607 | 0.00010 |

Hardware | 0.75804 | 0.00084 |

OS | 0.77949 | 0.00026 |

Summary | 0.20772 | 0.02588 |

Mean | 0.6493 | 0.19294 |

Unbiased SD | 0.02071 | 0.05401 |

We apply a stochastic differential equation model to manage the maintenance effort in the operational phase of OSS projects. In the past, our research group has been proposed the jump diffusion process model [

d J ( t ) d t = { D ( t ) + γ g ( t ) } { p − J ( t ) } . (1)

The parameters of Equation (1) are as follows:

J ( t ) : the cumulative maintenance effort expenditures up to operational time t ( t ≥ 0 ) in the OSS development project, this takes on continuous real values.

D ( t ) : the increase rate of maintenance effort at operational time t and a non-negative function,

γ : a positive constant representing a magnitude of the irregular fluctuation,

g ( t ) : a standardized Gaussian white noise,

p: the estimated amount of maintenance effort required until the end of operation.

We extend to the following stochastic differential equation of an Itô type [

d J ( t ) = { D ( t ) − 1 2 γ 2 } { p − J ( t ) } d t + γ { p − J ( t ) } d ω ( t ) . (2)

Then, the parameter ω ( t ) is defined as

ω ( t ) : one-dimensional Wiener process which is formally defined as an integration of the white noise g ( t ) with respect to time t.

Then, the jump term can be added to the stochastic differential equation models in order to incorporate the irregular state around the time t by various external factors in the operation phase of OSS project. Then, the jump-diffusion process [

d J j ( t ) = { D ( t ) − 1 2 γ 2 } { p − J ( t ) } d t + γ { p − J j ( t ) } d ω ( t ) + d { ∑ i = 1 ν t ( λ ) ( ρ i − 1 ) } . (3)

ν t ( λ ) : a Poisson point process with parameter λ at operation time t. The number of occurred jumps, and λ the jump rate. ν t ( λ ) and ω ( t ) , and ρ i are assumed to be mutually independent.

ρ i : i-th jump range.

By using Itô’s formula [

J j e ( t ) = p [ 1 − exp { − q t − γ ω ( t ) − ∑ i = 1 ν t ( λ ) log ρ i } ] , (4)

J j s ( t ) = p [ 1 − ( 1 + q t ) exp { − q t − γ ω ( t ) − ∑ i = 1 ν t ( λ ) log ρ i } ] . (5)

Considering the effort expenditure phenomenon, we define the normal distribution function as Gaussian Jump-diffusion process in order to consider the characteristics of software effort-growth phenomena:

ρ i ≡ f i ( x ) = 1 2 π τ exp [ − ( x − μ ) 2 2 τ 2 ] . (6)

Then, we assume that the i-th jump range ρ i are approximately is estimated as the positive values in almost all cases, because the mean value μ keep a large value. The jump process is mutually independent from Wiener process in our model. Then, we will be able to estimate several parameters of jump term separated from ones of Wiener term.

In particular, we apply the mean and unbiased standard deviation obtained from the analysis results of contribution rate in section 2.1 to the parameters μ and τ included in ρ i , i.e., the estimated mean is 0.6493, the estimated unbiased standard deviation 0.19294 in case of section 2.1.

Based on section 2.1, we show several numerical examples for jump diffusion process model.

In particular,

The characteristics of our method can estimate the software effort based on several fault category recorded on the bug tracking system. Then, the proposed method can provide the information of mutual interaction among several fault category by using the jump noise. Thereby, the OSS managers will be able to assess the stability of OSS project.

This paper has proposed the reliability assessment method based on quantification method of the second type and jump diffusion process model for OSS big fault data. The purposes of the proposed method are as follows:

1) In terms of the quantification method of the second type, it is important to understand several fault categories, because the fault big data sets are recorded with many fault contents. Also, it will be helpful to use many fault contents, not only effort data. Then, the contribution rate is very important measure. The fault category has the large impact, if the value of contribution rate is large. On the other hand, the fault category has the small impact, if the value of contribution rate is small. In particular, the factor in case the small value of contribution rate has little effect on the software effort. This means that the factors in case the small value of contribution rate will appear as the noise for software effort.

2) In terms of the jump diffusion process model, we can understand the unexpected changes by using the jump term of jump diffusion process model. However, it is difficult to estimate the parameters of jump term in terms of fault big data because of the complex category data. Therefore, we have proposed the estimation method by using the linear discriminant analysis as known the quantification method of the second type. Thereby, it is possible to assess considering the standpoint of the interaction among several fault category.

Above mentioned reasons, the proposed method will be useful to assess the OSS development effort by using the jump noises from the standpoint of the interaction among several fault factors. Therefore, our method can simply use for the other OSS. The proposed method can find the main factors as explanatory variables affecting the quality control. Thereby, the OSS developer will be able to easily assess the quality from the standpoint of the condition recorded from actual fault big data.

This work was supported in part by the JSPS KAKENHI Grant No. 20K11799 in Japan.

The authors declare no conflicts of interest regarding the publication of this paper.

Tamura, Y., Watanabe, H. and Yamada, S. (2020) OSS Project Assessment Based on Discriminant Analysis and Jump Diffusion Process Model for Fault Big Data. American Journal of Operations Research, 10, 269-283. https://doi.org/10.4236/ajor.2020.106015