Predictive analysis of microarray data

Microarray gene expression data are analyzed by means of a Bayesian nonparametric model, with emphasis on prediction of future observables, yielding a method for selection of differentially expressed genes and a classifier.


Introduction
DNA microarrays are devices used to determine the expression (activity level) of a set of genes contained in a tissue sample. Briefly, they consist of small arrays of thousands of probes on which surfaces are deposited many copies of single stranded DNAs sequences corresponding to specific genes, or pieces of genes. Reverse transcription of messenger RNAs extracted from the tissue produces a solution of DNAs whose sequences are complementary to those found on the microarray probes. This solution is colored and put into contact with the microarray surface. Sequences present in the solution hybridize with their complementary pairs on the microarray probes. Subsequent illumination of the microarray surface provides an image in which the intensity of each probe spot is related to the corresponding amount of messenger RNAs present in the tissue. Digital processing of this image outputs for each probe a positive number which measures the relative expression of the corresponding genes (see [1] and references therein for a detailed description of microarray technology).
Data from a typical microarray experiment consist of positive numbers representing the expression levels of the genes associated with the microarray probes for a group of individuals. Because the convoluted nature of the numeric values describing the expression levels makes it difficult to commit to a specific family of probability distributions in their modeling, our proposal is to approach this problem by means of a Bayesian nonparametric analysis. The emphasis placed by De Finetti [2] on prediction guides us, in the sense that both products of our analysis, a subset of differentially expressed genes and the corresponding classifier, are derived from probabilities of events related to values of future observables, with (unobservable) parameters playing only a subsidiary role.

Microarray data model
Our microarray data consist of the expression levels of p gene probes for m case patients that have been diagnosed with a certain disease or show some physiological alteration, and n healthy control individuals. The expression level of the j-th microarray probe for the i-th case patient is denoted by X j i . Similarly, expression levels for controls are denoted by Y j i . The expression levels of the p gene probes for the i-th case patient are abbreviated by X i = (X 1 i , . . . , X p i ). For controls, we define similarly . The graph below depicts the microarray data model. Absence of an arrow connecting two random objects means that they are conditionally independent given their parents. In this graph, the orphan vertexes are independent Dirichlet processes distributed as F j ∼ DP(c j , F j 0 ) and G j ∼ DP(d j , G j 0 ), for gene probes j = 1, . . . , p. Necessary Dirichlet process properties and notations are collected in the first appendix.

Predictive selection of differentially expressed genes
Suppose that we have microarray gene expression data for m case patients and n healthy control individuals. Following the notations introduced in the previous section and using the convention that upper case letters represent random variables and small case letters their realizations, we denote this data by Our first goal is to use the information contained in this data to establish which genes are expected to be less or more active for a future case patient, making these differentially expressed genes a subset of disease markers.
For each gene probe, the relative expression of the corresponding gene can be determined by our posterior opinion that for a future case patient the expression level of this probe will be smaller than the expression level of the same probe for a future healthy control individual, given all the information contained in the data. This posterior opinion is quantified by the posterior predictive probabilities for gene probes j = 1, . . . p. Using the nonparametric data model of the previous section, these posterior predictive probabilities can be computed using the results given in the first appendix.
The values of these probabilities determine an ascending ranking of relative expression of the gene probes. We denote by X (j) i the expression level of the gene probe occupying the j-th position in this ranking for the i-th case patient. To refer to the gene probes at the end of the ranking, we use the notation X , are used for healthy control individuals.
The criterion for the choice of the subset of differentially expressed genes is to select from this ranking the first k (down regulated) gene probes and the last k (up regulated) ones, for some integer k ≥ 1. In the last section we show how k can be selected by cross validation.

Predictive classification
With the subset of 2k differentially expressed gene probes obtained in the previous section, we construct a classification rule that allows us to pick microarray data for a new individual and classify him as unhealthy or healthy. Let us denote case patients and healthy controls together as (Z 1 , . . . , Z m+n ) = (X 1 , . . . , X m , Y 1 , . . . , Y n ) .

Defining the statistic T (Z
i , which is an increasing function of the expression level of the up regulated gene probes and a decreasing function of the down regulated ones, we expect case patients to exhibit values for this statistic that are larger than the corresponding values for healthy controls. The classification rule is based on this one dimensional statistic and takes into account the different sample sizes of the two groups, cases and controls. Given a new individual for which we have microarray data z m+n+1 , the rule is to classify him as healthy when T (z m+n+1 ) is less than the critical value t * for which otherwise, we classify him as unhealthy. The critical value t * is computed using the results given in the first appendix.

Example
The publicly accessible Gene Expression Omnibus (GEO) database [3] provides microarray data from a study [4] of peripheral circulating B cells for m = 39 smoking and n = 40 non-smoking healthy american white women.
We proceed the analysis of this dataset using the results of the previous sections and considering the case of weak prior information about the distribution of the expression levels of the gene probes, which means that within the nonparametric model we compute all the desired probabilities taking the limit to zero of the concentration parameters of the Dirichlet processes. Table 1 shows the identifiers for the k = 4 pairs of down and up regulated gene probes computed for this dataset. This table also presents a leave one out cross validated study of the sensitivity and specificity of the predictive classifier for this dataset using the k = 4 pairs of down and up regulated gene probes. For this dataset, k = 4 is the smallest number of pairs of up and down regulated gene probes that gives us the best balance between cross validated fractions of false negatives and false positives. Computer code in the Perl language [5] is presented in the second appendix.

Acknowledgments
This paper is dedicated to Paulo Cilas Marques in memoriam. We thank Professor Luiz Eugênio Barbosa de Oliveira for his critical reading of the manuscript. Work partially supported by CAPES.

Appendix 1: The Dirichlet process
We are concerned with the representation of our uncertainties about some observable properties assuming values in a sampling space X , with sigma-field A , by means of a probability measure defined over an underlying measurable space (Ω, F ). The probability of an event B ∈ F is denoted by Pr(B).
The map Q : A × Ω → [0, 1] is a random probability measure over (X , A ) if Q( · , ω) is a probability measure over this measurable space for every ω ∈ Ω, and Q(A) = Q(A, · ) is a random variable for each A ∈ A .
Ferguson [6] defined a random probability measure Q as follows. Let α be a finite nonzero measure over (X , A ) and specify that for each A -measurable partition {A 1 , . . . , A k } of X the random vector has the usual Dirichlet distribution with parameters (α(A 1 ), . . . , α(A k )). One such Q is denominated a Dirichlet process with base measure α.
Ferguson proved that this definition entails the following facts. First, Q is a properly defined random process in the sense of Kolmogorov's consistency theorem [7]. Second, the expectation of Q has the simple expression E[Q(A)] = α(A)/α(X ), for each A ∈ A . Third, if measurable observables X i : Ω → X are conditionally independent and identically distributed, given Q, with Pr(X i ∈ A | Q) = Q(A) almost surely, for i = 1, . . . , m, then a posteriori Q is again a Dirichlet process with base measure β defined almost surely by β(A) = α(A) + m i=1 I A (X i ), for each A ∈ A . If we add a new observable X m+1 to the just described conditional model, its posterior predictive probability is , almost surely, for every A ∈ A , in which the second equality follows from the conditional independence of the observables.
For microarray data the sampling space can be taken as the real line with Borel sigma-field. If Q is a Dirichlet process with base measure α, it is convenient to work with the random distribution function defined by F (t, ω) = Q((−∞, t], ω). We abbreviate F (t) = F (t, · ). Defining c = α(R) and F 0 (t) = α(−∞, t]/α(R), we denote the distribution of the random distribution function by F ∼ DP(c, F 0 ). Since β(R) = c + n, the posterior expectation of F is almost surelŷ is the empirical distribution function. This gives us an interpretation of the base measure of the Dirichlet process. The total measure c works as a concentration parameter: for fixed sample size m, if we make c ↓ 0, the posterior expectation reduces to the empirical distribution function. Also, this expression ofF 0,m shows that prior information contained in F 0 is washed out when, for fixed c, we let m → ∞.
Finally, suppose that we have a second sample: let Y 1 , . . . , Y n , Y n+1 be conditionally independent and identically distributed, given G, each one of them having conditional distribution G, and G ∼ DP(d, G 0 ) is independent of F . The posterior expectation is almost surelŷ . If U and V are independent random variables with distribution functions F U and F V , respectively, a simple computation shows that . Therefore, since X m+1 and Y n+1 are conditionally independent, given {X i } m i=1 and {Y i } n i=1 , and almost surely If we let c, d ↓ 0, this conditional probability reduces to