Using Confidence Statements to Ordering Medians: A Simple Microarray Nonparametric Analysis

Comparing two samples about corresponding parameters of their respective populations is an old and classical statistical problem. In this paper, we present a simple yet effective tool to compare two samples through their medians. We calculate the confidence of the statement “the median of the first population is strictly smaller (larger) than the median of the second.” We analyze two real data sets and empirically demonstrate the quality of the confidence for such a statement. This confidence in the order of the medians is to be seen as a pre-analysis tool that can provide useful insights for comparing two or more populations. The method is entirely based on their exact distribution with no need for asymptotic considerations. We also provide the Quor statistical software, an R package that implements the ideas discussed in this work.


Introduction
This paper proposes an analysis that can be used as an aid for subsequent more complex statistical data analyses, like classification, clustering, logistic regression, etc. For more details see [1]. We discuss ideas to compare two independent groups and to evaluate a measure that indicates which group has smaller (larger) values than the other one. They are simple and effective without the need for sophisticated techniques. This work was motivated by the following example in oncology: preoperative Gleason scores, in general, provide valuable prognoses for cases with prostate cancer. However, this is not verified for patients with a high score of Gleason-7. This group of patients is characterized by tumours displaying considerable morphological heterogeneity among affected regions. Microarray data have been collected to search for a gene set that could distinguish between recurrent (R) and non-recurrent (NR) Gleason-7 prostate cancer patients. A possible important gene that is associated with this disease is the RPS28 gene. In the study, there are two samples: the first sample has 5 m = of R patients, and the second sample has 20 n = of NR patients. Table 1 lists the microarray expression data for the 25 patients, and an illustration is given in Figure 1. As in many medical experiments, there are only a few cases in this study, and most of them are non-recurrent.
Suppose that the expression of a specific important gene is observed for each patient of the two independent samples, let the recurrent and non-recurrent cases, with inter-ordered samples (observations), be, respectively, , , n y y y  ; m and n are the sample sizes. The objective is to find genes that are under (or over) expressed, which is sometimes expressed by the statement that an expected microarray observation of an R case, x, is smaller (larger) than the expected observation of an NR case, y. In other words, it is conjectured that, for x and y being observations of random variables X and Y, one could expect, for under (over) expressed situations that the probability of { } X Y < is larger (smaller) than a specified value, for example 0.8 (0.2). One of the statistical hypotheses that could indicate the validity of the conjecture is X Y M M < (with M used to indicate median and the subscripts used to separate R and NR cases). Note that uppercase letters are used for random variables and parameters and lowercase letters for observations: probabilities refer to X and Y, and confidence refers to x and y.  as well). We name this measure as confidence statement. The proposed confidence statement was developed following the ideas of the non-parametric confidence interval for a population's median based on the binomial distribution. The article is organized as follows: in Section 2, we give a brief review of the confidence interval for the population's median, and then we introduce the confidence statement; in Section 3, we analyze two real data examples, discussing the applicability of the procedure; and in Section 4, we provide conclusions and final remarks.

Confidence Intervals for Medians
In this section, we present the non-parametric confidence interval for a population's median based on the binomial distribution. For additional details we refer to [2] [3] [Chap. 7].
An event-related to a random variable X is represented by A, while X M is the median of X.

( )
Pr | X A M indicates the probability of the event A when X M is known. In general, the median X M of a random variable X is a population parameter that satisfies the following inequalities: In the continuous case, these inequalities are tight: Considering that ( ) is a vector of m independent and identically distributed random variables, we have that ( ) Again, taking the complement, one obtains the probability of the event After observing the sample, we write that the statement has a confidence equal to ( ) We call the attention of the reader to the subtle difference between probability and confidence, as presented in [4], which justifies the use of distinct terminology. To clarify, before the observations are obtained and by using the order statistics (minimum and maximum), we write the following expression: . For instance, considering 8 m = , we obtain the confidence values for the statements are those in which we are interested. For i j < and by using the same arguments of the previous discussion, we have the following probabilities: To obtain the confidence of the interval i j x x , the same argument of tossing a fair coin is used. We then obtain the following: We are interested in the interval with 95% of confidence. Our procedure is based in an exact discrete distribution, and it will not obtain an exact 95% standard level (or any other level) but a close one: the higher the sample size, the closer it will be. Our simulated data produce the intervals ( ) The length of our 97.85% interval is 0.9698, smaller than 1.3480, which is the length of the standard one based on the t-student distribution, with 95.45% of confidence. Thus, we obtained a more confident shorter interval.

Confidence Statement on the Order of Medians
Returning to the problem of two samples that are used to compare two sub-populations, assume they are named case and control, the goal is to analyze the statement that the population median X M of X is smaller (larger) than the population median is true. Recall that we use the notation . We can write the following probabilities: and then for the joint probability one obtains This is a consequence of the fact that ( ) ( ) 5 In other words, we are 96.3% confident about the statement { } X Y M M < .

The Schizophrenia Data Set
The Schizophrenia data set is from the Altar A study of the

Discussion
In the prostate cancer example, it must be noticed that by using the one side t-test one obtains a p-value of 7.24% (14.48% for the two-sided test). This is used for the two-sided test). µ here is the notation for the mean, not for medians. Such a particular test has only asymptotic properties if the distributions of X and Y are not normal. On the other hand, the present paper proposes a method that does not use any distribution restriction, is exact and valid for any sample size. discovery of differentially expressed transcripts, before conducting other more complex/specialized procedures.
The method can be extended to more than two groups. In order to do that, the confidence level to detect a strict order has to be studied in more detail. The larger the number of sample groups, the smaller is the expected confidence. This is because the product of numbers belonging to the interval ( ) 0,1 clearly produces numbers that are smaller than any of their factors, for instance, consider 3 random variables, X, Y and Z. The following inequality is obvious: