Bayesian Non-Parametric Mixture Model with Application to Modeling Biological Markers

The effect of treatment on patient’s outcome can easily be determined through the impact of the treatment on biological events. Observing the treatment for patients for a certain period of time can help in determining whether there is any change in the biomarker of the patient. It is important to study how the biomarker changes due to treatment and whether for different individuals located in separate centers can be clustered together since they might have different distributions. The study is motivated by a Bayesian non-parametric mixture model, which is more flexible when compared to the Bayesian Parametric models and is capable of borrowing information across different centers allowing them to be grouped together. To this end, this research modeled Biological markers taking into consideration the Surrogate markers. The study employed the nested Dirichlet process prior, which is easily peaceable on different distributions for several centers, with centers from the same Dirichlet process component clustered automatically together. The study sampled from the posterior by use of Markov chain Monte carol algorithm. The model is illustrated using a simulation study to see how it performs on simulated data. Clearly, from the simulation study it was clear that, the model was capable of clustering data into different clusters.


Introduction
To model hierarchical data when the distribution is not known is a big problem and has affected many researchers dealing with big data [1]. This is because of the disparity within the data, to account for the heterogeneity a Bayesian non-parametric model is necessary as it leads to flexible density estimates which are capable of identifying clusters of individuals with similar biomarker characteristics. Bayesian non-parametric mixture model is a good fit to model biological markers because it exhibits flexibility when modeling data which has a skewed and multi-modal distribution. The reason behind this is because data sets become bigger every day and require flexible models which can expand with the data. Mixture methods approach allows for probabilistic approach of clustering data points to different clusters [2]. The model also gives support to out of sample cluster assignments through computing the posterior probabilities for new data points.
In clinical trials, the importance of a treatment is either to decrease the burden of the disease for the patient or to eliminate the disease. To identify a biomarker which is changed by a treatment is not easy due to difficulties associated with the disease mechanisms. If a biomarker which is affected by the treatment has been identified, coming up with the association of the biomarker and the outcome is not easy because of the changes in the variability of the biomarker, patient response, and evaluation methods used. Thus, it is important to identify the changes each individual exhibit and whether there are changes or no changes as a result of the treatment [3]. The responses of individuals to treatment may be related, and identifying of groups of individuals sharing similar characteristics is of important.
Many authors have applied the Bayesian non-parametric procedures to study various categories of biomarkers ranging from prognostic, predictive, phamacodynamic, and surrogate endpoints. For example, [4] studied the prognostic biomarkers and showed how they related to the clinical outcome using the Bayesian non-parametric procedures. Additionally, [3] studied the prognostic biomarkers using Bayesian parametric procedures, and finally [5] studied the surrogate endpoints using the Bayesian methods. These studies identified the need to study biomarkers and determine how they are related with the clinical outcome.
Bayesian non-parametrics have a wide application in many areas especially big data analytics. Bayesian non-parametric methods are widely used to solve problems where the size of the data changes leading to growth of the dimension of interest, for instance, in problems where the number of features varies with increase in the observed data. Also, they are commonly used in clustering and the number of clusters depends on the data being used. In general, in Bayesian non-parametrics models the number of parameters increases as the size of the data grows.
The study of [4] applied the Bayesian non-parametrics in modeling biological markers. In the study the model assumed measurements of the biomarkers were taken continuously before the subjects under study are introduced to treatment and after the patient has been given some treatment. In the study the measurements were not depended on covariates and the survival result was due to mea- Accordingly, [1] developed an integrative Bayesian predictive modeling framework to identify individual pathological brain states depending on the choice of fluoro-deoxyglucose positron emission tomography (PET) imaging Biomarkers and evaluated the relation of the states with a clinical outcome. The study would identify patient subgroup characterized by different biomarkers to produce the clinical outcome. The strategy also identified imaging Biomarkers with pathological states of the individuals and assumed that the latent individual state gets its values from one of the pathological states, and one of the states was a reference point. The latent random variables were independent and identically distributed taking a multinomial distribution. On the mixture weights a Dirichlet prior was used, considering a where the Gaussian distribution was considered, the mean was taken as one of the parameters to model the latent state specific random effect and to characterize the mean metabolic profile for individuals within the latent state. The Variance-covariance matrix captured the association between regions for individuals with latent state. A likelihood function was also established.
Additionally, [6] developed a Bayesian model to sample inference with availability of inverse-probability weights. The study used a hierarchical method where the distribution of the weights from the non-sampled units was modeled and included predictors in a non-parametric Gaussian process. Simulation study was used to check how the procedure performed and compared to the classical design-based estimator. The study concluded that Bayesian non-parametric finite population estimator is more appropriate compared with the classical estimator.
Also, [7] compared the hierarchical Bayes model for biomarker subset effects in clinical trials to the profile likelihood method, to make references to the threshold parameter using bootstrap. The method provided improved sample properties for probability coverage at 95% confidence interval. Therefore, the importance of modeling surrogate markers in this study is to be able to determine the relationship between the baseline biomarker and the samples taken after an individual has been given some treatment. Bayesian non-parametric methods are flexible methods and will accurately indicate the relationship to show whether there is any change and be able to identify groups of individuals which have similar characteristics through clustering [8]. Also, the method is capable of showing whether after treatment the distribution of the biomarker changed through increase, decrease or it did not change at all.
The other parts of the paper are arranged as follows; in Section 2, discussion of the general modeling framework is done. Section 3, discusses the proposed model by detailing the nested Dirichlet process model for characterizing patient profiles. In Section 4, the hierarchical model is formulated. Section 5, describes the posterior computation. Section 6, is a simulation study to assess the performance of the model. Finally, the conclusion is in Section 7.

General Modeling Framework
Let T denote the treatment effect, X represent the baseline biomarker, Y denote the post treatment values, and E the clinical outcome, and Z are the covariates which are present. If p(.) is a distribution, for instance, ( ) | , , , P E X Y Z T , is a conditional distribution. If the treatment impact T is put into consideration, then the biomarker distribution will be affected. To address this then the inpatient change from X to Y is necessary. To assess the inpatient change, then putting into consideration of the relationship between X and Y because of the inpatient effects is necessary. Due to the effect the treatment has on Y and the effect of the covariate to X or Y thus it leads to, ( ) P Y X Z T and ( ) | P X Z , though the distribution can either be highly disperse and complex. The model in this study will involve representation of a biomarker profile as , to symbolize the change made on the biomarker because of treatment, incorporating them to the model to include the impact of the change on the outcome E.
The model is also able to classify groups of individuals with various changes in Biomarker profiles depending on how the impacts of T and the change ∆ have on E. Thus, employing the probabilistic factorization then; From Equation (1), the following assumptions are made; , which implies, with the effect of the covariates and the treatment, the impact of the (X, Y) on E is indicated by the change.
2) Also, the distribution of X and Y may be depended on the covariate, then the study assumes that both do not depend on the covariates.
A hierarchical Bayesian non-parametric model is employed for ( ) , | p X Y T and for the ( ) ; a non-parametric regression model in the Bayesian case is employed, to give adaptable cluster estimates for individual's specific distributions of ∆ and their clusters. A hierarchical structure is obtained through making assumption of the individual's specific Dirichlet processes being samples that are conditionally independent and obtained from a hyperprior which is also a Dirichlet process.

Proposed Model
Here the structure of the data is developed and the general model introduced. The subjects are indexed by , , , . For the i th individual let n i and m i be the measurement frequencies of the levels of the biomarker obtained before treatment and after. Let  be the individuals pre and post-treatment biomarker values, where, , is a representation of the individual Journal of Data Analysis and Information Processing change for the levels of the biomarker before and after treatment. Where T i is the treatment given to the i th individual and ∆ is some measure of distributional distance. The distributional distance is defined on a sample space cumulative density function (Cdf) of one-dimensional random variables, which is the distributional distance between the two cdf's F X and F Y in the space of cumulative density function. The vertical quantile function is; where, Equation (2) where F X and F Y are the cdf's of the diagnostic variables in the populations. Here, the interest is not to assess the diagnostic performance for a biomarker; however, to evaluate the targeted treatment, the vertical quantile function is estimated by taking into consideration the distribution functions F Xi and F Yi for the subject levels of biomarker for different individuals. Therefore, the distributional change is; Equation (3) For β a vector of parameters for regression modeling and θ parameterizes the hierarchical model. The mixture components are defined as w i for each component with the constraint such that, , implying that the total probability distribution will normalize to 1. Thus, the Gaussian mixture model is represented as; Assuming a DP with a concentration parameter α and a base distribution G 0 .
To assess the change in the distribution of Y i verses X i in terms of i ∆ so as to be able to investigate the association of the change with the outcome and classify groups of subjects which have the same biological responses. Additionally, a prior model is defined on G Xi and G Yi and it involves the Dirichlet process (DP), which is commonly preferred prior probability model due to its clustering capability. [9] expressed this as ( ) 0 DP , G G α , which is a random distribution G following a DP that has a base distribution where G 0k is a realization from a common DP prior that is . Therefore each G Xi and G Yi is automatically obtained from a collection of different distributions that is the G 0k 's.

Formulation of the Hierarchical Model
The hierarchical model is formulated using the nested Dirichlet Process (nDP) which is as follows; In the earlier discussions, it is clearly expressed that i ∆ is a functional of ( ) , | ,

Posterior Computation
To compute the joint posterior distribution for model parameters, this is done computationally. Thus Markov Chain Monte Carlo (MCMC) algorithm for posterior inference is used. The full conditional to update the nDP are gotten using the method described by [10] depending on a truncated Dirichlet process. At each iteration, for the baseline distribution * 0 G , parameters are continuously updated based on all the samples represented by the biomarker values. The algorithm is developed using a truncation of a Dirichlet process to give approximate truncation to the stick breaking process of a Dirichlet process leading to method of computation in finite mixture models.
This assumes that, individuals are clustered into K groups and for every indi-   Table 1, and plotted in Figure 1.
The true distributions are plotted in Figure 1.
Distribution S1 and S2 are asymmetric with a mixture of two Gaussian components with different weights. For distribution S3 and S4, they share three  V δ = .
The algorithm described in Section 5 is used to obtain the samples of the posterior distribution using the nested Dirichlet Process. The study runs MCMC chain with 12,000 iterations, discarding the first 2000 iterations and thinning out to save one in every 10 iterations.
The estimated distributions ( ) | k E F y for each distributional cluster are represented in Figure 3. Figure 3 is an image of Figure 1. This is a clear indication that the prior and the posterior samples obtained after the MCMC draws are the same and reflect the distribution where each of the observation has been obtained from. The posterior draws are drawn from all the distributions with all the components. Hence, the posterior and the prior distribution are the same. Thus, in this case when using the Bayesian non-parametric mixture model it reflects the individual biomarker distributions before treatment taking the same form as the after treatment measurements drawn from different centers.
Also, the posterior cluster memberships takes the same form as the true cluster memberships as clearly shown in Figure 4.
The posterior co-clustering probabilities take the same form as the true cluster membership. The model developed is able to classify groups of individuals from different centers (distributions) to one group. The individuals are placed into the groups as per the prior information which is available. Hence, the diagram displays four clusters similar to the estimated distribution as shown in Figure 4.

Conclusions
We introduced a model using the truncated nested Dirichlet process to identify groups of individuals who respond similarly to the same treatment for a specified biological marker. An MCMC algorithm has been used to estimate the posterior inference. Since the nDP is a non-parametric model, it has the capability of grouping all the observations from the mixture depending on the entire distribution, rather than selecting particular features of the distribution. In the simulation study the proposed method for biological markers showed a good performance in differentiating the unimodal distributions from the multimodal distributions.
The proposed procedure in this paper reveals that Bayesian non-parametric This work can be extended to model the relationship between two or more groups of data after the individuals have been clustered. Also, the procedure did not take into consideration of the covariates which might affect the biomarkers.
This can also be incorporated so as to see whether they have any effect.