^{1}

^{2}

Comparing two samples about corresponding parameters of their respective populations is an old and classical statistical problem. In this paper, we present a simple yet effective tool to compare two samples through their medians. We calculate the confidence of the statement “the median of the first population is strictly smaller (larger) than the median of the second.” We analy z e two real data sets and empirically demonstrate the quality of the confidence for such a statement. This confidence in the order of the medians is to be seen as a pre-analysis tool that can provide useful insights for comparing two or more populations. The method is entirely based on their exact distribution with no need for asymptotic considerations. We also provide the Quor statistical software, an R package that implements the ideas discussed in this work.

This paper proposes an analysis that can be used as an aid for subsequent more complex statistical data analyses, like classification, clustering, logistic regression, etc. For more details see [

Suppose that the expression of a specific important gene is observed for each patient of the two independent samples, let the recurrent and non-recurrent cases, with inter-ordered samples (observations), be, respectively,

Recurrent | 14.8557 | 15.2209 | 15.3839 | 15.4106 | 15.4155 |
---|---|---|---|---|---|

4*Non-recurrent | 14.9309 | 14.9535 | 15.1009 | 15.1622 | 15.4361 |

15.4716 | 15.4932 | 15.5545 | 15.5584 | 15.5622 | |

15.5629 | 15.5741 | 15.5759 | 15.6101 | 15.6211 | |

15.6488 | 15.6638 | 15.6684 | 15.6966 | 15.6984 |

We propose a measure to evaluate the confidence of the statement

In this section, we present the non-parametric confidence interval for a population’s median based on the binomial distribution. For additional details we refer to [

An event-related to a random variable X is represented by A, while

In the continuous case, these inequalities are tight:

Considering that

After observing the sample, we write that the statement

After observing the sample,

As an analogy, one can think of the above method as equivalent to tossing a coin m times, computing the probability of zero successes, which is

Letting i and j be indices in the set

To obtain the confidence of the interval

For

To illustrate the confidence interval, we generate a sample with

We are interested in the interval with 95% of confidence. Our procedure is based in an exact discrete distribution, and it will not obtain an exact 95% standard level (or any other level) but a close one: the higher the sample size, the closer it will be. Our simulated data produce the intervals

The length of our 97.85% interval is 0.9698, smaller than 1.3480, which is the length of the standard one based on the t-student distribution, with 95.45% of confidence. Thus, we obtained a more confident shorter interval.

Returning to the problem of two samples that are used to compare two sub-populations, assume they are named case and control, the goal is to analyze the statement that the population median

Suppose that there are observations

and then for the joint probability one obtains

After observing that

We point out that we are looking for the shortest interval with high confidence. Consequently, to evaluate the confidence of the statement

The closer we get to 1, the more confident we are about

In the example shown in

This is a consequence of the fact that

In other words, we are 96.3% confident about the statement

The Schizophrenia data set is from the Altar A study of the Stanley Medical Research Institute’s online genomics database (SMRIDB) [

In the prostate cancer example, it must be noticed that by using the one side t-test one obtains a p-value of 7.24% (14.48% for the two-sided test). This is used to test

Transcripts | Confidence | Status | Median Order^{*} |
---|---|---|---|

215003 | 0.99609 | Under | |

208581 | 0.99521 | Over | |

212854 | 0.99200 | Over | |

216336 | 0.98681 | Over | |

212294 | 0.98681 | Over | |

213626 | 0.98549 | Over | |

209847 | 0.98549 | Under | |

208399 | 0.98549 | Under | |

204326 | 0.98549 | Over | |

221011 | 0.98439 | Under |

^{*}M_{S}: median for schizophrenic patients and M_{C}: median for control individuals. Under: For the specific transcript, the schizophrenic group is under expressed in comparison to the control individuals. Over: For the specific transcript, the schizophrenic group is over expressed in comparison to the control individuals.

The development of the present method builds on the studies from [

By using the equivalence of confidence statements and significance testing DeGroot 1975 [

In the schizophrenia example, we analysed all 20993 genes to find those that were most differentially expressed. We found that among the 10 most differentially expressed transcripts, 4 were under, and 6 were over-expressed. Also, all confidence values were higher than 98%, which are good confidence levels in our opinion.

This work intends to provide a method that can be employed as a first-step procedure whenever a data set is to be analyzed. The authors believe that this method can be used to eliminate those variables that have no power to help in the discovery of differentially expressed transcripts, before conducting other more complex/specialized procedures.

The method can be extended to more than two groups. In order to do that, the confidence level to detect a strict order has to be studied in more detail. The larger the number of sample groups, the smaller is the expected confidence. This is because the product of numbers belonging to the interval

If the observed order of statistics follows the inequality

de Campos et al. [

The procedure to evaluate the confidence statement is available in the R package Quor at https://code.google.com/archive/p/quor/. The package is distributed as an open-source program under GPLv3 license.

Carlos Alberto de Braganca Pereira is CNPq Fellow-Brazil (308776/2014-3).

The authors declare no conflicts of interest regarding the publication of this paper.

de B. Pereira, C.A. and Polpo, A. (2020) Using Confidence Statements to Ordering Medians: A Simple Microarray Nonparametric Analysis. Open Journal of Statistics, 10, 154-162. https://doi.org/10.4236/ojs.2020.101012