Quasi-Negative Binomial: Properties, Parametric Estimation, Regression Model and Application to RNA-SEQ Data

Background: The Poisson and the Negative Binomial distributions are com-monly used to model count data. The Poisson is characterized by the equality of mean and variance whereas the Negative Binomial has a variance larger than the mean and therefore both models are appropriate to model over-dispersed count data. Objectives: A new two-parameter probability distribution called the Quasi-Negative Binomial Distribution (QNBD) is being studied in this paper, generalizing the well-known negative binomial distribution. This model turns out to be quite flexible for analyzing count data. Our main objectives are to estimate the parameters of the proposed distribution and to discuss its applicability to genetics data. As an application, we demonstrate that the QNBD regression representation is utilized to model genomics data sets. Results: The new distribution is shown to provide a good fit with respect to the “Akaike Information Criterion”, AIC, considered a measure of model goodness of fit. The proposed distribution may serve as a viable alternative to other distributions available in the literature for modeling count data exhibiting overdispersion, arising in various fields of scientific investigation such as genomics and biomedicine.


Introduction
A random variable X is said to have "Quasi Negative Binomial Distribution", QNBD if the probability function is given by: The distribution whose probability function is given in (1) was first derived by Takács [1] as a queuing model. He assumed that we have a single server queue with independent customers arriving according to a Poisson process in batches of size ( ) 1 β − with traffic intensity π and exponential service time with mean 1/α. It is also assumed that the service time is independent of the interarrival time. Under these conditions, the probabilities of arrival ( ) θ π π α = + and departure = 1 − θ. Takács [1] and later Consul and Gupta [2] showed that the probability that a buy period has ( ) 1 x β − is for fixed β given by (1). The distribution is a member of the Lagrange class of distributions [3] [4] [5].
The shape of the histogram of X depends on the combination (β, θ). In Figure   1 & Figure 2 we can see that the distribution has a much longer tail for large values of β.
The paper is structured as follows: In Section 2 we demonstrate the connection between the QNBD and the regular exponential family of distributions [6] and derive the higher order central moments of the distribution. A limiting form of the distribution will be investigated as well. In Section 3, we derive the first order approximation of the variances and biases of the moment estimators of (β, θ). In Section 4, we derive the maximum likelihood estimators and their asymptotic variances and biases. In Section 5, we develop the regression model and establish discuss the maximum likelihood estimation of the regression parameters. In Section 6, we apply the models to real-life data arising from genomic studies. We provide general discussion in Section 7.

Moments of the Distribution
The simplest approach to derive the higher order central moments of the distribution is to first write (1) in the general form of the linear exponential family.
For fixed β, the QNBD belongs to the regular exponential family of discrete random variables:  µ′ and variance 2 µ of X are given respectively by:

Moment Estimators
Suppose that we have a random sample 1 Note that, the information matrix is the determinant of the variance covariance matrix of the moment estimators and is given by: Example: Modeling the number of brain lesions to predict Multiple Sclerosis.
The use of gadolinium (Gd) withT1 weighted imaging can identify areas of breakdown in the blood-brain barrier and increases the reliability and in detecting active Multiple Sclerosis (MS) lesions [8]. The number of new Gd enhancing lesions is a widely used end point for monitoring disease activity and for evaluating the effect of treatments in phase II clinical trials. In these studies, the results of the Magnetic Resonance Imaging (MRI) end point are in the form of counts [9]. To deal with the problem of overdispersion, the negative binomial distribution is used to model this type of data. As application of the QNBD we simulated lesions count data like the situation described in [8] (Table 1). The sample size = 116 subjects.
The histogram of the data s is given in Figure 3. The y-axis we have the frequency of each x.  Table 1. The x-axis we have the x values. From Figure 5 we may infer that the distribution of β seems to be a mixture of two distribution or is bimodal. From these results, we may conclude that the moment estimators are not reliable unless we have extremely large sample.
In the next section, we discuss the maximum likelihood estimation.

Maximum Likelihood Estimators (MLE)
It is well-known that the estimators obtained from application of the method of MLE possess optimal properties such asymptotic normality and efficiency. Based on a simple random sample the log-likelihood (l) function is given by: Similarly, setting l β ∂ ∂ equal to zero and solving for β we get: The MLEs of θ and β are thus obtained by solving (11) and (12) iteratively, noting that (12) is in the form of ( ) Elements of the variance-covariance matrix of the ( ) , θ β are obtained by inverting the Fisher's information matrix. We can show that We note that on using the digamma approximation we can write The R-Code for fitting the QNBD is given in Appendix 1.

Orthogonal Polynomial Approximation for i ββ
The evaluation of the asymptotic variance covariance matrix is difficult because does not have a tractable form. To overcome this difficulty, following [5] we employ an asymptotic expansion for 2 log x P β ∂ ∂ as a linear combination of orthogonal polynomials. From Morgan et al. [9], if x P is a distribution function with feint moments r µ of all orders, then the point 0 If the distribution function P has atleast Y points of increase, Cramér [10] has proved that there exists a sequence of polynomials where r a are selected so that: Direct calculations give: The asymptotic relative efficiency of the moment estimators is therefore given by: For the lesion data Eff = 16.6%. We interpret this number as follows: for the moment estimators to be as efficient as the maximum likelihood estimators, we need a sample size that is 16.6% larger compared to the sample size used for the maximum likelihood estimation.

Asymptotic biases of the MLE
Unlike the moment estimators, the ( ) , θ β do not have closed form expressions, and the applications of the delta method cannot be used to obtain their asymptotic biases. Sherton and Wallington [12] used an approach that depends on the asymptotic expansion of the log-likelihood functions. We denote the biases of θ , and β by ( ) In the above system of equation we have the following notations: We can show that  2  2  1  3  2  3  3  3 4  2  5  2  3  2  2  2 4  3  2 Finally, using the above information we can show that 2,22 0 P = .
Solving the system of equations, we obtain the asymptotic biases so that

Quasi Negative Binomial Regression
Our aim in this section is develop regression model based on the GNBD. The approach is facilitated by the fact that the QNBD is a member of the regular exponential family shown in [13]. We employ the transformation: Here we assume to ( ) i τ θ be monotone, differentiable, and positive function of θ [13]. In (16) z is a vector of ( ) In this section, we derive the maximum likelihood estimators of the regression parameters, the parameter β and their asymptotic properties. The log-likelihood function is given by:       The simplest approach to obtain the maximum likelihood estimators of γ and β is by solving the equations; var 1 x µ µ = + . Expressing the distribution in terms of the mean parameter μ, the limiting distribution can be written as: In a paper that follows, we shall discuss the issues of maximum likelihood estimation for the parameter μ of the probability function (18)

Data Analysis: RNA_SEQ Data: Modeling the Distribution of Read Counts
Over the past decade, various statistical analysis tools have been developed to analyze expression profiling data generated by microarrays (Reviewed in [15] [16] [17]). Before these tools can be applied to RNA-Seq data, it is worth noting that microarray data and RNA-Seq data are inherently different [16]. Microarray data is "analog" since expression levels are represented as continuous hybridization signal intensities. In contrast, RNA-Seq data is "digital", representing expression levels as discrete counts. This inherent difference leads to the difference in the parametric statistical methods that are used since they often depend on the assumptions of the random mechanism that generates the data. The Poisson, Binomial and Negative binomial distributions are more suitable for modeling discrete data in an RNA-Seq experiment. Therefore, a statistical method developed for microarray data analysis cannot be directly applied to RNA-Seq data analysis without first examining the underlying distributions. Recently several statistical methods have been developed to deal specifically with RNA-Seq count data [17]. In an RNA-Seq dataset, the expression levels of a specific gene were modeled using the Poisson distribution. This Poisson model is verified in the case where there are only technical replicates using a single source of RNA [15].
In the Poisson model, over-dispersion occurs if the sample variance is greater than the sample mean. There could be several sources that cause over-dispersion in RNA-Seq data, including the variability in biological replicates due to heterogeneity within a population of cells, possible correlation between gene expressions due to regulation, and other uncontrolled variations [18]. The existence of over-dispersion in real data was observed in several previous studies [18]. Popular models to safeguard against over-dispersion include the negative binomial distribution, or two-stage Poisson distribution [19], as discussed below.
When over-dispersion is observed across the samples, the gene counts cannot be estimated accurately by a simple Poisson model [20]. for the yeast data. We analyzed the read count of the Mice-Brain tissue data under four experimental conditions: Z 1 = Chrom_ chr11, Z 2 = Chrom chr9_ra, Z 3 = Chrom chrUn_ra, and Z 4 = Chrom chr13_ra, and d = the gene dispersion levels. Z j are modeled as categorical variables with categorical with Z 4 being the reference category, and d is measured on the continuous scale. Figure 6 shows the histogram of the read counts for the 4 groups (Tables 2(a)-(d)).
We now analyze the data using three count regression models; the Poisson, the Negative binomial, and the QNB (Tables 3-5).  3) Quasi negative binomial regression model.

Comments on the Data Fitting
We used three count regression models to fit the RNA-SEQ data. All models

Discussion
There has been a growing interest among bioinformaticians and statisticians in constructing flexible distributions for counts that exhibit overdispersion to improve the modeling of count data. As a result, significant progress has been made towards generalizing some well-known discrete models, which have been successfully applied to problems arising in several areas of research. The proposed distribution was utilized to model two data sets; it was shown to provide a better fit than several other related models, including some with the same number of parameters. In the future paper, we shall demonstrate the applicability of the limiting form of our proposed distribution to genomics data together with inference procedures using multiple samples. Finally, we believe that the infe- b1=par [2] b2=par [3] b3=par [4] b4=par [5] beta=par [6] n=length(y) eta=b0+b1*x1+b2*x2+b3*x3+b4*x4 mu=exp(eta)/(1+exp(eta)) ll=sum(log(beta-1)-log(beta-1+beta*y) +lgamma(beta+beta*y)-lgamma(1+y)-lgamma(beta+beta*y-y) +y*log(mu)+(beta+beta*y-1-y)*log(1-mu)) return(-ll)