OSS Project Assessment Based on Discriminant Analysis and Jump Diffusion Process Model for Fault Big Data

The bug tracking system is well known as the project support tool of open source software. There are many categorical data sets recorded on the bug tracking system. In the past, many reliability assessment methods have been proposed in the research area of software reliability. Also, there are several software project analyses based on the software effort data such as the earned value management. In particular, the software reliability growth models can apply to the system testing phase of software development. On the other hand, the software effort analysis can apply to all development phase, because the fault data is only recorded on the testing phase. We focus on the big fault data and effort data of open source software. Then, it is difficult to assess by using the typical statistical assessment method, because the data recorded on the bug tracking system is large scale. Also, we discuss the jump diffusion process model based on the estimation method of jump parameters by using the discriminant analysis. Moreover, we analyze actual big fault data to show numerical examples of software effort assessment considering many categorical data set.


Introduction
Many open source software (OSS) are used in various areas of mobile devices, IoT, server-side application, cloud computing, edge computing and database software. The development paradigm of OSS is different from the typical soft-ware development style. In particular, the maintenance phase is the fault-datadriven fixing style by using the bug tracking system. The big fault data sets are recorded on the OSS bug tracking system. Then, the size of fault data is approximately over tens of thousands lines. However, it is very difficult to assess the OSS reliability by using the typical statistical method, because the size of data recorded on the bug tracking system is very large scale. As an example, there is the problem for the degrees of freedom in case of the statistical approach. Generally, the degree of freedom is the number of data. Therefore, it is very difficult to decide the degrees of freedom in case of the big data. In this paper, we propose the fusion-method of statistical and stochastic modeling approaches.
The traditional reliability assessment methods based on software reliability growth models have been proposed by several research groups [1] [2]. Moreover, several research papers for OSS reliability assessment have been published in the past [3] [4] [5]. On the other hand, few statistical methods for OSS reliability assessment have been proposed [6] [7], because it is difficult to assess by using the statistical analysis.
We propose the fusion-method project assessment based on quantification method of second type and jump diffusion process model. Then, we can resolve the statistical problem in case of the large scale data analysis such as the big fault data by using our method. Moreover, we show several analysis examples based on the proposed method by using the actual big fault data.
We show the organization of this paper. First, Section 2 proposes the linear discriminant analysis for big fault data. Then, several categorical data are analyzed by using the actual fault data. Moreover, we discuss the jump diffusion process model as the stochastic approach. Section 3 shows the fusion-method of the statistical and stochastic modeling approaches by using the actual OSS fault data. Section 4 summarizes the characteristics of our method.

Statistical and Stochastic Modeling Approaches
In terms of the jump term in the jump diffusion process, it is difficult to estimate the unknown parameters of jump term, because the jump diffusion process model has the different stochastic processes. In particular, the jump diffusion process model has two stochastic processes consisting of the Wiener process and jump diffusion one. Also, the jump diffusion process model is assumed that the Wiener process is independent of jump diffusion process. Therefore, we can define the parameter of jump term individually. Then, we propose the estimation method of jump term parameters by using the linear discriminant analysis according to the following procedure.
Step 1: Generally, it is difficult to understand the total trends of big fault data. Therefore, we apply the linear discriminant analysis in order to confirm the correlation of the specified factor for all factor. Thereby, we can understand the mutual interaction in the big fault data.
Step 2: Then, we focus on the contribution rate based on the analysis results by linear discriminant analysis. In particular, the contribution rate is the important measure, because the contribution rate means the changing rate of each factor for changes in the entire data. Therefore, we consider that the estimates of jump term parameters by using the contribution rate will be useful to assess the reliability considering the characteristic of big fault data.
Step 3: We estimate the mean and variance of contribution rates for all factors from the analysis results of big fault data. Then, we apply the mean and variance of contribution rates to the unknown parameters of jump term.
Step 4: Then, we can show several reliability assessment measures based on the jump diffusion process model.

Linear Discriminant Analysis
We focus on Fisher's linear discriminant analysis. Considering the linear discriminant analysis, it is assumed that the applied data is satisfied the following conditions: 1) The data is based on the normal distribution.
2) Each class has the same covariance matrix.
3) Variables are independent each other.  The right side cluster becomes large. In particular, the faults group of newly version is placed to the right side cluster.
Reporter: Four clusters are composed by the analyzation. We can consider that this is the unbiased result. This software has been reported by various ununiformed reporters. Severity: Two clusters are estimated. In particular, the right-bottom cluster becomes large. The clusters of Reporter and Severity may be the same situation, because two clusters are the same shape.
Status: We cannot find the characteristics from this figure.
Resolution: There are three types of cluster.
Hardware: There are two types of cluster. In particular, we found that the Reporter, Severity, and Hardware show the same tendency.
OS: The specified factor has biased. Summary: The level of uniformity is high.  In this paper, we analysis the highest contribution rate for all factors, because the contribution rate means the changing rate of each factor for changes in the entire data. Table 1 shows the estimated largest contribution rate for each factor. From Table 1, we found that the mean is 0.19294, the unbiased standard deviation is 0.05401 in case of the minimum value. In this paper, we can define the jump term of jump diffusion process model by using the estimated mean and unbiased standard deviation obtained from the contribution rate. We consider that the degrees of influence for the number of faults becomes large in case of the maximum value of contribution rates. On the other hand, the degrees of influence for the number of faults becomes small in case of the minimum value of contribution rates, i.e., it is appropriate to assess by using the jump term of jump diffusion process model.

Jump Diffusion Process Model
We apply a stochastic differential equation model to manage the maintenance effort in the operational phase of OSS projects. In the past, our research group has been proposed the jump diffusion process model [9] [10]. First, we discuss the flexible jump diffusion process model. The jump diffusion process model has derived from the following stochastic differential equation with Brownian motion [11] [12]: The parameters of Equation (1)  We extend to the following stochastic differential equation of an Itô type [11]: Then, the jump term can be added to the stochastic differential equation models in order to incorporate the irregular state around the time t by various external factors in the operation phase of OSS project. Then, the jump-diffusion process [9] [10] [13] is given as By using Itô's formula [11] [12], the solution of the former equation can be obtained as follows: On the other hand, Figure 16 and Figure 17 show the estimated sample path cumulative software effort and sample path of the required effort expense for S-shaped type model. From Figures 14-17, the S-shaped type model is optimistically estimated in comparison with the exponential type model, because the estimated sample path of the required effort expense for S-shaped type model is smaller than the exponential type model in Figure 15 and Figure 17, respectively.
In particular, Figure 18 shows the estimated distribution function ( ) i f x in Equation (6) based on the contribution rates. Figure 18 is important role for the proposed method, because the jump parameter is estimated by using the linear discriminant analysis in order to summarize the interaction among complex categories recorded on the big fault data.
The characteristics of our method can estimate the software effort based on several fault category recorded on the bug tracking system. Then, the proposed method can provide the information of mutual interaction among several fault category by using the jump noise. Thereby, the OSS managers will be able to assess the stability of OSS project.

Conclusions
This paper has proposed the reliability assessment method based on quantification method of the second type and jump diffusion process model for OSS big fault data. The purposes of the proposed method are as follows:     1) In terms of the quantification method of the second type, it is important to understand several fault categories, because the fault big data sets are recorded with many fault contents. Also, it will be helpful to use many fault contents, not only effort data. Then, the contribution rate is very important measure. The fault category has the large impact, if the value of contribution rate is large. On the other hand, the fault category has the small impact, if the value of contribution rate is small. In particular, the factor in case the small value of contribution rate has little effect on the software effort. This means that the factors in case the small value of contribution rate will appear as the noise for software effort.
2) In terms of the jump diffusion process model, we can understand the unexpected changes by using the jump term of jump diffusion process model. However, it is difficult to estimate the parameters of jump term in terms of fault big data because of the complex category data. Therefore, we have proposed the estimation method by using the linear discriminant analysis as known the quantification method of the second type. Thereby, it is possible to assess considering the standpoint of the interaction among several fault category.
Above mentioned reasons, the proposed method will be useful to assess the OSS development effort by using the jump noises from the standpoint of the interaction among several fault factors. Therefore, our method can simply use for the other OSS. The proposed method can find the main factors as explanatory variables affecting the quality control. Thereby, the OSS developer will be able to easily assess the quality from the standpoint of the condition recorded from actual fault big data.