A Software Reliability Model for OSS Including Various Fault Data Based on Proportional Hazard-Rate Model

The software reliability model is the stochastic model to measure the software reliability quantitatively. A Hazard-Rate Model is the well-known one as the typical software reliability model. We propose Hazard-Rate Models Considering Fault Severity Levels (CFSL) for Open Source Software (OSS). The purpose of this research is to make the Hazard-Rate Model considering CFSL adapt to baseline hazard function and 2 kinds of faults data in Bug Tracking System (BTS), i.e., we use the covariate vectors in Cox proportional Hazard-Rate Model. Also, we show the numerical examples by evaluating the performance of our proposed model. As the result, we compare the performance of our model with the Ha-zard-Rate Model CFSL.


Introduction
Open Source Software (OSS) is used by many organizations in various situations because of its low cost, standardization, and quick delivery. However, the quality of OSS is not ensured, because OSS is developed by many volunteers around the world in a unique development style. Then, the development style has no organized testing phase. The faults latent in OSS are usually fixed by using the database of Bug Tracking System (BTS). There is various information related to faults in BTS. The reliability assessment of OSS is necessary and important for the de-mand in the future and the current problem of OSS. The software reliability model is a mathematical model to measure software reliability in statistical and stochastic approaches. As of today, many various models not only for proprietary software but also for OSS have been proposed by a lot of researchers [1]- [6]. The Hazard-Rate model is well known as the typical software reliability model [7] [8] [9] [10]. We proposed a Hazard-Rate Model Considering Fault Severity Levels (CFSL) for OSS in the past [11]. Mostly, a lot of Hazard-Rate Models measure the software reliability with only the data of the time of occurrence of software failures in the testing or operation phase. However, we can get various information related to faults of software aside from the data of the time of occurrence of software failures. As for previous research, the Hazard-Rate Model includes the data of the failure identification work and execution time in CPU, which are called environment data in the paper. Then, the related models have been proposed in the past by using

Bug Tracking System
BTS is the database. This is that OSS users can report the information about faults in OSS. There is various information in BTS, e.g., the recorded time of fault, the time of fault to be fixed, the nickname of fault assignee, and so on. We show the list of fault data in BTS in Table 1.

Hazard-Rate Model
Firstly, we show the stochastic quantities related to the number of software faults and the time of occurrence of software failures in testing phase or operating phase as shown in Figure 1.
The distribution function of ( ) 1, 2, k X k =  representing the time-interval between successive detected faults of ( ) st 1 k − and k th is defined as: where: Pr{A} represents the occurrence probability of event A. Therefore, the following derived function means the probability density function of k X :

Changed
The modified date and time.

Product
The name of product included in OSS.

Component
The name of component included in OSS.

Version
The version number of OSS.

Reporter
The nickname of fault reporter.

Assignee
The nickname of fault assignee.

Severity
The level of fault.

Status
The fixing status of fault.

Resolution
The status of resolution of fault.

Hardware
The name of hardware under fault occurrence.

OS
The name of operating system under fault occurrence.

Summary
The brief contents of fault. Also, the software reliability can be defined as the probability that a software failure does not occur during the time-interval ( ] 0, x . The software reliability is given by: From Equations (1)-(3), the hazard-rate is given by the following equation: where: the Hazard-Rate means the software failure rate when the software failure does not occur during the time-interval ( ] 0, x . A Hazard-Rate Model is a soft-ware reliability model representing the software failure-occurrence phenomenon by the Hazard-Rate. Moreover, we discuss three Hazard-Rate Models as follows.

Jelinski-Moranda Model
Jelinski-Moranda (J-M) model is one of the Hazard-Rate Models. J-M model has the following assumptions: 1) The software failure rate during a failure interval is constant and is proportional to the number of faults remaining in the software; 2) The number of remaining faults in the software decreases by one each time a software failure occurs; 3) Any fault that remains in the software has the same probability of causing a software failure at any time.
From the above assumptions, the software Hazard-Rate in Equation (4) at k th can be derived as: where: each parameter is defined as follows: N: the number of latent software faults before the testing; φ : the Hazard-Rate per inherent fault.

Moranda Model
Moranda model has the following assumptions: The software failure rate per software fault is constant and is decreasing geometrically as a fault is discovered.
From the above assumptions, the software Hazard-Rate in Equation (4) at k th can be derived as: where each parameter is defined as follows: D: the initial Hazard-Rate for the software failure; c: the decrease coefficient for Hazard-Rate.

Xie Model
Xie model has the following assumptions: The software failure rate per software fault is constant and is decreasing exponentially with the number of faults remaining in the software.
From the above assumptions, the software Hazard-Rate in Equation (4) at k th can be derived as: where each parameter is defined as follows: N: the number of latent software faults before the testing;

Mean Time between Failures (MTBF)
Three Hazard-Rate Models above have the following assumption: Any fault that remains in the software have the same probability of causing s software failure at any time.  We assume that the fault data is divided into the following types in terms of

Hazard-Rate Model Considering Fault Severity Levels (CFSL)
where each parameter is defined as follows: ( )

Cox Proportional Hazard-Rate Model
Cox PHM is the model representing Hazard-Rate by using baseline hazard function, which is subject for a variable of time, and covariate vector. In this section, we discuss about Cox PHM. It is assumed that two kinds of vectors are defined as follows: where each vector is defined as follows: k α : the covariate vector including q kinds of data Therefore, Cox PHM is defined as follows by using two vectors above: where: (14) is called baseline hazard function and is subject for a variable of k x . β : the coefficient parameter for k α .

Proposed Model
In this paper, we apply the exponential Hazard-Rate Model to the baseline hazard function. Thus, the proposed model can be regarded as a parametric model. Moreover, the distribution function and the density function of k X are derived as a Equation (16), (17) respectively.
( ) For this reason, the parameters in the proposed model can be estimated by MLE (Maximum Likelihood Estimation).

Numerical Example
We use of fault big data in Apache HTTP server to estimate MTBF as the evalua-tion of the performance of our proposed model compared to Hazard-Rate Model CFSL [14]. The data of assignee is converted in numerical one in the form of frequency of occurrence. Specifically, our proposed model is divided into three cases as follows: PHM1: the data of assignee is only included in k α ; PHM2: MTBC is only included in k α ; PHM3: the data of assignee and MTBC are included in k α .
The parameters in the proposed models are estimated by MLE (Maximum Likelihood Estimation). The estimated value of parameters in three models is shown in Table 2.
In Table 2 . Therefore, there is not multicollinearity in PHM3. As a criterion to measure the goodness-of-fit of our proposed model, we use AIC (Akaike's Information based on the maximum likelihood estimation of model parameters Criterion). Figures 2-5 show the estimated MTBF for each model and Table 3 shows the       Table 3. In other words, PHM is possible to predict the MTBF of OSS more correctly.

Conclusions
In OSS is popular and in demand for a lot of organizations in various situations. However, OSS is developed by many volunteers in the world without an explicit testing phase. Therefore, the reliability of OSS is not ensured. For this reason, it is necessary to measure software reliability quantitatively. There are various fault data in the BTS of OSS. Then, the data sets are useful to find the characteristics of OSS. Moreover, we can assess software reliability accurately by using not only the data of the time of occurrence of software failures in the testing or operation phase but also the other various fault data in BTS.
In BTS, there are many kinds of fault data aside from the one we used in this paper. Therefore, we will discuss the proposal of other software reliability models with other kinds of fault data in BTS as future research. Also, we would like to suggest new measurements for OSS reliability including the characteristics of OSS.