Confirmatory Methods , or Huge Samples , Are Required to Obtain Power for the Evaluation of Theories

Experimental studies are usually designed with specific expectations about the results in mind. However, most researchers apply some form of omnibus test to test for any differences, with follow up tests like pairwise comparisons or simple effects analyses for further investigation of the effects. The power to find full support for the theory with such an exploratory approach which is usually based on multiple testing is, however, rather disappointing. With the simulations in this paper we showed that many of the common choices in hypothesis testing led to a severely underpowered form of theory evaluation. Furthermore, some less commonly used approaches were presented and a comparison of results in terms of power to find support for the theory was made. We concluded that confirmatory methods are required in the context of theory evaluation and that the scientific literature would benefit from a clearer distinction between confirmatory and exploratory findings. Also, we emphasis the importance of reporting all tests, significant or not, including the appropriate sample statistics like means and standard deviations. Another recommendation is related to the fact that researchers, when they discuss the conclusions of their own study, seem to underestimate the role of sampling variability. The execution of more replication studies in combination with proper reporting of all results provides insight in between study variability and the amount of chance findings.


Introduction
Experimental studies are usually designed with specific expectations about the results in mind.Van den Hout and colleagues, for instance, designed a study to investigate differences in performance between interventions for posttraumatic stress disorder [1].While it has been shown that Eye Movement Desensitization and Reprocessing (EMDR) is an effective treatment, recently, therapists sometimes replace eye movements (EMs) by alternating beep tones.To investigate if the intervention based on beep tones was: 1) effective at all, and 2) equally effective as the intervention using EMs, patients were randomized over three groups: recall only, recall with EMs, or recall with beep tones.Three competing expectations for the outcome were formulated: H 1 : beep tones are as effective as EMs.
H 2 : beep tones are not effective at all.H 3 : beep tones are effective, but not as effective as EMs.
In terms of the three conditions, this can also be expressed as:

{ }
EMs beep tones recall only > = .H 3 : EMs beep tones recall only > > .The main goal of this experiment was to evaluate for which of these three competing hypotheses the data provided most support.
Another illustration of research with specific expectations about the results is presented by [2].In a study on the effect of stereotype threats on the math performance of women and men, they hypothesized that on a relative simple math test there would be no differences in performance between men and women, but on a difficult test where they expected both men and women to perform worse than on the simple test, they did also expect men to score better than women.Let µ denote the mean performance and the subscripts w = women, m = men, s = simple and d = difficult.The expectations can be expressed as: This is an example of a factorial design but the hypothesis of interest is not formulated as, nor approached by, (default) testing for main or interaction effects, but instead expresses the specific theory of the researcher in one hypothesis.
Both examples show that research expectations are often expressed using order constraints on the model parameters (e.g.means in experimental groups).Hypotheses in terms of such constraints are denoted ordered, inequality constrained, or informative hypotheses [3].We prefer the last term for two reasons.First, the hypothesis of interest can include order/inequality constraints (<, >), but also equality constraints (=) and unconstrained parts (denoted using a comma, e.g.{ } , µ µ µ > states that both 1 µ and 2 µ are greater than 3 µ , but there is no constraint with respect to the mutual relation of 1 µ and 2 µ ).Second, it emphasizes that the hypothesis is informative in the sense that it captures the information the researcher is interested in (i.e., the theory or explicit expectation).
A review of empirical literature shows that many research articles contain such hypotheses, that is, in the introduction of the paper the authors clearly state what their expectations with respect to (part of) the outcomes are.This is especially the case in experimental studies.Despite such prespecified expectations or theories, most researchers apply some form of omnibus test to test for any differences, with follow up tests like pairwise comparisons or simple effects analyses for further investigation of the effects.The power to find full support for the theory with such an approach is, however, rather disappointing.
To illustrate this consider the hypothesis expressing the expectation that four means are of increasing magnitude, that is, the hypothesis states what is called a simple ordering of four means: informative 1 2 3 4 : H µ µ µ µ < < < .After assuring that all assumptions to perform an analysis of variance (ANOVA) are met, we believe that the majority of researchers would approach this hypothesis by first testing the omnibus F-test to see if there is evidence for any differences between the four means.After rejection of the null hypothesis "all means equal", one would probably investigate the pairwise comparisons to determine which means differ from each other.Throughout the paper, 0.05 α = will be used to determine statistical significance.Full support for the theory could be claimed if 1) the omnibus F-test is statistically significant, 2) the sample means are in the hypothesized order, and 3) the pairwise comparisons testing 01 1 2 : : : H µ µ = are statistically significant.Note that other approaches to decide on full support for H informative are available and several will be discussed and investigated in the next section.
In a small simulation study, with population means in the hypothesized order, an effect size that can be la-beled as medium (Cohen's 0.27 f = ) and a sample size of 50 per group, the power of the omnibus ANOVA was 0.92.Stated differently, in 92% of the data sets that were all simulated from the specified population, the hypothesis 0 1 2 3 4 : was rejected (p < 0.05).However, the percentage of data sets in which also the sample means were in the hypothesized order (i.e., ) was only 65%.Worse, when counting the data sets in which also all three pairwise tests were statistically significant (p < 0.05), we ended up with a disappointing result of zero!The power to find full support for the hypothesized order, where full support is defined as finding statistically significant pairwise differences between the four subsequent means, is 0.00.
From these numbers it is clear that the power for an omnibus ANOVA and the power to find support for a specific expectation about a pattern of means can deviate substantially.Similar results and conclusions were previously reported in [4], although not in the context of testing informative (order constrained) hypotheses but in the general context of multiple testing.In [4] it was argued that multiple testing causes studies to be underpowered and that this leads to inconsistencies in the published literature.Multiple testing is also the main explanation for the low power in our illustration.
This paper has two main goals.We will show that many of the common choices in hypothesis testing lead to a severely underpowered form of theory evaluation.Furthermore, we will compare the results with available but less commonly used approaches and discuss when each of them could serve as a valuable and more powerful alternative.
In the next section six approaches are described that can be used in the context of a one-way ANOVA when the hypothesis of interest is a simple ordering of k means.For a variety of populations, two questions are investigated: "What is the power to find full support given that the power for the omnibus test is 80%?", and "What is the required sample size to obtain 80% full support power for the specific expectations?".In Section 3, the results are reported of simulation studies meant to investigate the specific interaction hypothesis in the two-way design of the math performance example just introduced.The paper is concluded with a discussion of results and possible implications for psychological research.

One-Way Analysis of Variance
In the context of a one-way ANOVA with k groups, six approaches are presented that researchers could employ when evaluating the explicit research hypothesis that the means are increasing, that is: informative H : . Although we do not claim that these are the only options available, we do believe that many researchers will recognize one or more of the presented approaches and probably have employed them in their own research.With simulation studies the performance of the six approaches will be systematically evaluated for a variety of populations.The question that is investigated is: If the hypothesized ordering of means is indeed present in the population, how often will each of the approaches find full support for this hypothesis?The technical aspects of approaches I-V (all using NHT) are provided in Appendix A. A short summary of approach VI (a Bayesian approach) is presented in Appendix B.
In Section 2.1, we present three approaches that are frequently seen in published research papers.However, these methods are not the best choice for theory evaluation, that is, for testing explicit hypotheses.Therefore, in Section 2.2, three alternative approaches are presented that may be better suited to evaluate pre-specified explicit hypotheses, but that are probably less familiar to some researchers.Sections 2.3, 2.4, and 2.5 present the results of several simulation studies.

Three Omnibus Test Based Approaches
The first three approaches are based on performing an omnibus ANOVA, despite the fact that the hypothesis of interest is more specific than the hypothesis evaluated with the omnibus test: A H : not 0 H . Additionally, to evaluate the actual research hypothesis (i.e., to be able to claim full support), three different follow-up procedures are considered.

I. Omnibus ANOVA + sample means in hypothesized order
To claim support for the research hypothesis, the omnibus test must be statistically significant (p < 0.05) and the sample means ( ) j M must be in the hypothesized order.To have convincing evidence for the specific hypothesis and to get the work published it seems, however, necessary to include follow-up testing.
, etc.).Since we are now applying multiple tests (each with α level 0.05) to get an answer to one specific research question, the issue of inflated type 1 errors emerges.Researchers have to make a decision about how to control the family wise error and make a choice between a long list of available correction methods.This is not yet done in approach II, where no α correction is made.Note that this is equal to using the LSD (Least Significant Dif- ference) method that SPSS offers (see, for instance, [5]).
III. Omnibus ANOVA + sample means in hypothesized order + all pairwise tests for subsequent means significantly different (with Bonferonni corrected α ) In the third approach, the Bonferonni α correction is applied for the pairwise tests.The Bonferonni correc- tion divides the desired overall α level by the total number of pairwise comparisons.Approaches II and III, therefore, provide results for two extremes: LSD is very liberal (no correction), Bonferonni is rather conservative (stringent correction).Note that default SPSS Bonferonni output is based on the total number of possible tests, that is, ( ) k k − .However, since we investigate a simple ordering, we only need ( ) pairwise comparisons and will therefore use a less stringent correction (retaining more power).

Three Alternative Approaches
To do justice to the confirmatory nature of research, for informative

H
an approach that tests the hypothesis more directly would be a better choice.Here, we present three approaches that can be used to evaluate an informative hypothesis that states a simple order of means.
IV. Multiple planned contrasts (one-sided) Planned contrast testing is an alternative to omnibus testing and can be used whenever pre-specified hypotheses are available (e.g., [6]).In case of a simple order of k means, one option is to test 1 k − contrasts, where each contrast i C ( ) represents the pairwise comparison of two subsequent means.The set of contrasts for 6 k = means, for instance, is: This provides, for example, 1 e., with one sided p -values).With planned contrast testing it is not necessary to first evaluate the omnibus ANOVA, but to have full support for the informative hypothesis each contrast must be statistically significant.
V. Linear contrast test (one-sided) For hypotheses imposing a simple order on a sequence of means, the linear contrast is a close approximation.The linear contrast weights for 3 k = , 4, and 6 means (the values that we will use in the simulations) are: This provides, for example, lin,4 . Since the contrast weights that are assigned to the sample means are increasing from negative to positive values, the value for lin,k C will be positive if the means are in the hypothesized order.Consider, for instance means A k H C > (i.e., with one sided p -values).An advantage of this approach, compared to the previous, is that with one test the, for the hypothesis at hand, relevant p -value is obtained.A disadvantage is that the hypothesis that is tested (is the linear increase significantly different from zero?) is not equal to the originally stated hypotheses (are all means ordered from smallest to largest?).
VI. Bayesian approach developed specifically for the evaluation of informative hypotheses Another method that will be evaluated is a Bayesian procedure specifically designed for the evaluation of informative hypotheses (see, for instance, [3] and [7]).With this model selection approach the support in the data for any hypothesis of interest is quantified with so-called Bayesian probabilities.Bayesian probabilities are numbers between zero and one reflecting the relative support for each hypothesis in a predefined set.In the simulation studies where the main interest is in a specific order constrained hypothesis, the set of models that will be compared consists of the null hypothesis ( ) stating that all means are equal and the informative hypothesis imposing the ordering ( ) To be able to compare the performance of the Bayesian model selection with the results based on p-values, in the simulation studies we will use the Bayesian probabilities to make dichotomous decisions, that is, either the informative hypothesis received the most support, or not.Note that making such dichotomous decisions does fit in the NHT framework (a result is statistically significant or not, usually judged with the 0.05 criterion) but not in the Bayesian framework, where it is up to the researcher to decide if he/she considers the resulting support for a certain hypothesis worthwhile ( [3], page 51).A short summary of the Bayesian approach used in the simulations is provided in Appendix B. More extensive, non-technical introductions of Bayesian evaluation of informative hypotheses are provided in [8]- [10].

Defining the Populations
We investigated hypotheses expressing a simple ordering of 3 k = , 4 and 6 means.The population parame- ter values were defined in agreement with the informative hypothesis and varied to obtain different effect sizes.In the context of an ANOVA the common effect size (ES) measure is Cohen's f , which is the ratio of the between groups standard deviation ( ) M σ and the within groups (residual) standard deviation ( ) Cohen proposed to label 0.1 f = as a small effect, 0.25 f = as a medium effect and 0.4 f = as a large effect.In Table 1, the subpopulation means for each combination of k and ES are provided assuming residual variation 1 W σ = .

Power and Sample Sizes
From each population presented in Table 1, 10,000 data sets were sampled and subsequently analyzed with approaches I-V.The results for the Bayesian approach are based on 1000 data sets due to its intensive computation time.The sample sizes of the data sets are based on a power analysis using the following assumptions: 1) nowadays, it is more or less standard practice to start a research project with a power analysis to determine the number of required participants to obtain power of 0.80; 2) we expect that most researchers perform their power analysis for the omnibus test (i.e., for the one-way ANOVA) and that they do not take possible follow-up analyses and/or alpha corrections for multiple testing into account in the power analysis.Therefore, for each population, the required sample size to have 0.80 power for the omnibus ANOVA was determined and used in the simulations (numbers are reported in Table 2).
Additionally, for each of the six approaches, the approximate sample sizes required to obtain 0.80 full support power, as defined by the approach at hand and for the informative hypothesis of interest, are determined.

Results
In Table 2, the results for the six approaches are presented.The sample sizes used are provided in the first column and are based on a power analysis to obtain 0.80 power for the omnibus ANOVA.Note that all reported sample sizes are group sizes ( j N for 1, , j k =  ).The last six columns present the power to find full support for the research hypothesis with each of these approaches and using the sample sizes from the first column.So, for the 10,000 (1000 for approach VI) data sets that were sampled from prespecified populations with ordered means and effect sizes as specified, the resulting numbers in the table represent the proportions of these samples in which full support, as defined by each of the methods, for the hypothesis was found.
The results in column I show that, even if the only requirement is that the observed sample means should be in the hypothesized order, the power to find full support diminishes fast with an increasing number of means in the ordered hypothesis (approximately 0.70 for k = 3, 0.50 for k = 4, and 0.10 for k = 6).The power to find full support for the true order using the requirement that additional pairwise tests (one-sided or two-sided, and, with or without alpha corrections) should be significant reduces the power to zero in most cases (only for k = 3, full support power ranges between 0.02 -0.15; see columns II-IV).Stated differently, with 10,000 replications from the same population, the true effect in the population was, with these methods, never fully confirmed.
The last two columns show that with the two confirmatory methods that do not rely on multiple tests, the power to find support for the ordered hypothesis is in almost all cases higher than the power of the omnibus test (ranging from 0.74 to 0.98).Further, it shows that for small effect sizes the linear contrast test has higher power than the Bayesian model selection approach, but that for medium and large effect sizes this is the other way around.
In Table 3, for the six approaches, the approximate required group sample sizes to obtain 0.80 full support power are provided.The numbers are obtained by running a sequence of simulations for each population (i.e., combination of k and ES) with increasing sample sizes.We did not evaluate group sample sizes larger than 1000 because we believe they are not realistic in experimental research, so further precision seems unnecessary.The notation ">1000" in Table 3 therefore means that with 1000 j N = the full support power was still below 80%.The results show that huge samples are required to have reasonable full support power to detect a small ES with any of the approaches I-IV (ranging from 360 to >1000 per group).Approach V is, for small ES, most powerful ( j N range: 125 -260) and outperforms the Bayesian approach ( j N range: 220 -315).However, given that the smallest required j N is still more than 100 respondents per subgroup, the results most of all show how difficult it is to find reliable and replicable support for specific expectations given that effect sizes are small.
The required sample sizes to find full support for medium effect sizes vary greatly between the approaches as well as between different numbers of subgroups.For ).Here, the gain of using approaches V or VI is clearly observed.Required j N range from 21 to 45, where the Bayesian approach slightly outperforms the linear contrast approach.Similar patterns were observed for large effect sizes, where j N ranged from 25 to 670 for approaches I-IV and from 4 to 18 for approaches V and VI.
Overall, the numbers in Table 2 and Table 3 clearly show that confirmatory methods that do not suffer from multiple testing issues (that is, approaches V and VI) are needed to have a good chance-with feasible sample sizes-to find full support for the true order of the means.

Additional Results for the Bayesian Approach
The Bayesian analysis for comparing < means more support for 0 H .In Table 2, the reported proportions (the "power" of the Bayesian approach) were based on counting how often, in 1000 replications, 10 BF was bigger than 1.The interpretation of BFs is however not intended to be dichotomous ("hypothesis is supported or not").To elaborate on the amount of support for the informative hypotheses that was found in the simulations, one could use the rules of thumb as presented by [12].They propose that 10 BF below 3 is still "not worth more than a bare mention", but that support can be claimed in the range 3 to 20 and that this support can be labeled as strong for 10 BF 20 > . For the simulations presented in Table 2, these elaborated results are provided in Table 4 for the medium effect size.
From Table 4, we can see that in 7% to 9% of the samples the null hypothesis is favored over the ordered hypothesis, leading to a wrong conclusion.Note that this information was also presented in Table 2, where the "power of the Bayesian approach" was defined as finding 10 BF 1 > .On the second line of Table 4, we see that in about 10% of the samples the evidence is weakly in favor of the ordered hypothesis.In the remaining samples the support for the ordered hypothesis is substantial (in 22% ( ) of the samples) or even strong (49% for k = 6 to 62% for k = 3).

Non-Linear True Effects
A hypothesis stating a simple ordering of means is not equal to a hypothesis stating a non-zero linear effect.It is interesting to see if the power of approach V also holds when the population means are ordered from small to large, but not linearly, and how this power compares to approach VI that explicitly states the expected order.j N = , 50 for all cells and different nonlinearly increasing population means.In Table 5, the investigated means are provided.The residual variance 1 W σ = and this provides effect sizes f as reported in the second column of the table.The results are based on 10,000 samples for approach V and 1000 samples for approach VI and reported in the last two columns of Table 5.
The results show that the power of the Bayesian approach is higher for k = 3 and k = 4 and that the differences between approaches V and VI are largest for 25 j N = .For k = 6, approach V outperforms approach VI for 50 j N = .No clear pattern emerges for 25 j N = : in some cases the power of approach V is higher and in others it is the other way around.

An Illustration of a Two-Way Analysis of Variance
Often, ANOVA tests are done in the context of factorial designs, that is, with two or more factors and an interest in main and/or interaction effects.The example provided in the introduction will be used as an illustration.The researchers investigated stereotypes and gender differences in math performance in three subsequent studies [2].
In their first study, a group of highly selected respondents (see [2] for details) consisting of 28 men and 28 women, was randomized over easier and difficult math tasks.The goal of this study was to investigate if a specifically described expected interaction pattern was found.They formulated their expectation, for the studied population, as: "women underperform on difficult tests but perform just as well on easier test" ([2], page 9).
The hypothesized outcome, assuming general lower performance on the difficult test compared to the simple test, is represented in Figure 1.Formulated as an informative hypothesis, the expectation is: : The tests executed and reported in [2], however, not directly address this expectation.They report F-tests for two main effects and an interaction effect (all p < 0.05), as well as posthoc pairwise comparisons of means.Therefore, multiple tests were required to come to the conclusion that, indeed, their expectation was supported.
A simulation study was designed to investigate the power to find full support for this specific expectation assuming different effect sizes and using several different approaches for the evaluation of the hypothesis.In Section 3.1 the approaches are presented and in Section 3.2 the design and results of the simulation study are provided.Hypothesized results for the study on stereotypes and gender differences in mathematics [2], with math performance on the y-axis.

Factorial Approach
Most researchers would analyze these data with a two way ANOVA, testing for both main effects and the interaction effect.Different follow-up strategies could be considered, leading to three approaches described below.To limit the number of variations only results for two-sided tests and without alpha corrections are reported.A. To conclude support for the theory, both main effects as well as the interaction effect should be statistically significant and the sample means ( ) should be in a specific order, that is: ( ) To conclude support for the theory, in approach B the three omnibus tests should be significant and the sample means in the right order (as in A) but also the simple main effects should support the theory.This implies finding a significant result for the test H µ µ = .C. Following [2], as a follow-up to the requirements of approach A, we tested all pairwise comparisons of means.The results should be non-significant for 0 , , : w s m s H µ µ = , while the other 5 pairwise comparisons must be statistically significant.

One Way Approach
Since the factorial approach is rather exploratory (testing for any main effect and any interaction and not for the specific, expected patterns), the omnibus tests could be skipped and instead planned comparisons on the four subgroup means could be executed and interpreted.Note that this implies ignoring the factorial structure in the design.Two approaches are included in the simulations: D. The first approach is based on planned comparisons on specific contrasts.The ordering . Support for the expectation can be concluded if the test : 0 H C > (i.e., with a one-sided p-value) is statistically significant.However, 1 C does not include the expectation that the last two means are not different.So, in addition we formulate the contrast: 2 , , , , , and to conclude support for the theory this contrast test should not be significant (two-sided p-value).
E. The Bayesian approach for informative hypotheses can evaluate the expected pattern directly.In the simulation we will evaluate how often the informative hypothesis { } H µ µ µ µ or a hypothesis expressing an alternative competing informative hypothesis.However, the current choices (evaluating against 0 H and drawing dichotomous conclusions) are made to be able to meaningfully compare the results with the NHT results in approaches A-D.

Simulation Study and Results
For the simulation study, several populations were specified with means in agreement with the informative hypothesis of [2].The residual variance was always one, and the differences between the means were increasing form relatively small to larger differences.The population means in five simulations are presented in Table 6.Results are also found in the table and are based on drawing 10,000 (1000 for approach E) samples with a sample size of 50 per group.
The results show, once again, that the power to find full support for a specific expectation dramatically decreases when several multiple tests are involved, as in approach A, B and-most of all-C.The power of approach D is already much higher.This can be explained by the fact that, here, only two tests were involved of which the first specifically represents the order of interest and was evaluated with a one-sided p -value, that is, it was evaluated in a relatively powerful, confirmatory way.Finally, the Bayesian approach (E) slightly outperforms approach D. The advantage of specifying precisely what one wants to know and evaluating this with a direct approach results in the highest power levels.

Discussion
Attention for limitations of null hypothesis testing in general, e.g., [13]- [15], and problems with power, lack of replication, and multiple testing specifically, e.g., [4] [16], is widespread in both statistical and applied research literature.In the past two decades, a Bayesian approach for the evaluation of informative hypothesis was presented as an alternative, confirmatory approach, e.g., [7] [17] [18].In these papers it is often claimed that with the formulation and evaluation of informative hypotheses more powerful methods are obtained.In a few papers, some examples are provided to support this claim with numbers, e.g., [19] [20].However, so far, no systematic study of the power of different-exploratory and confirmatory-approaches was reported and this is, therefore, the main contribution of this paper.
We presented several simulations in the context of evaluating a simple ordering of k means in a one-way design.The results, however, present a more general message and are similar when the k means come from a factorial design and irrespective of which expected pattern of means is evaluated.To illustrate this, one example of a two-way analysis of variance and an informative hypothesis that did impose a different set of constraints on the means was also provided.
Results in this paper show that the approaches that are mostly found in the research literature, that is, analysis of variance omnibus tests with multiple follow-up comparisons of means, have very limited power to detect the true pattern of means.Approaches that are specifically designed for the evaluation of prespecified expectations like planned contrast testing or the Bayesian approach for informative hypotheses do much better.Typical differences observed in the simulations for the one-way design were power levels between 0% -15% for approaches based on multiple testing, whereas the power of the confirmatory approaches reached power levels between 80% -100%.
Additional simulations were done to investigate what sample sizes would be needed to have reasonable power with the commonly used approaches.The main conclusion from these simulations is that it is practically unfeasible to detect the true pattern of means with such approaches if the effect size is small, and that it still requires huge group sample sizes to detect medium effects ( j N between 180 and more than 1000) or large effects ( j N between 75 and 670).Again, the two confirmatory approaches fare much better although, also here, the sample sizes to detect small effect sizes are relatively large (between 125 -315 per group).The results lead to a couple of recommendations.First of all, if specific expectations are formulated beforehand, as is, for instance, often the case in psychological experiments like those described in the paper, we strongly recommend considering approaches that use as few multiple tests as possible.Planned contrasts have much more power to detect the true patterns than omnibus ANOVA's with several follow-up tests, and so does the Bayesian approach.Whenever the expectation can be formulated in one contrast (e.g., the linear contrast we used in the one-way design to reflect the simple ordering of means), the differences in power between the contrast testing approach and the Bayesian approach are negligible for the effect sizes and numbers of groups investigated in this paper.A potential advantage of the Bayesian approach is flexibility in terms of the types of hypotheses that can be formulated and evaluated.Any expected pattern that can be expressed using a combination of smaller than (<), bigger than (>), equal to (=), and/or, no mutual constraint (,) can be evaluated using the Bayesian approach.An example where planned contrast testing required two tests and therefore resulted in less power to detect the true pattern than the Bayesian method was provided in the context of a factorial design and specific expectations about the interaction effect of the two factors.
The results of this paper also show how variable different samples from the same population can be.Although sampling variability is a concept known to all researchers that are familiar with data analysis and NHT, published literature shows that many researchers do often underestimate the size and consequences of sampling variability in their own study.Not finding a specific expected difference between two means, for instance, is often explained by substantive arguments.The fact that this could very likely be a type 2 error, due to limited power, is hardly ever mentioned, especially when some other interesting comparisons did reach statistical significance.Likewise, the finding of a significant difference between certain means that was not a priori expected often receives considerable attention.However, in the context of multiple testing the probability of finding at least one significant difference is large and therefore it might just as well be a chance finding (inflated type 1 error due to multiple testing).It seems that, significant and non-significant results are too often interpreted as rather certain indicators of the true effects.With this paper, we hope to contribute to the awareness that results from a single study can only provide conclusions with very limited certainty and that replication studies are crucial.
Another recommendation relates to the publication process, where a clearer distinction could be made between confirmatory and exploratory analyses.When specific theories or expectations are specified a priori and confirmatory methods are used to evaluate the expectations, conclusions can be relatively strong, although the need for replication studies will always remain.Other findings, or when no specific hypotheses were formulated, should be reported acknowledging the exploratory nature of the results.One can conclude that interesting findings were seen in this particular data set but replications with new data, and preferably confirmatory methods, are required before it can be concluded if they reflect real effects or chance findings.

Conclusion
The need for replication leads to a final recommendation, which is not at all original, but crucial to the accumulation of scientific knowledge.All results of all tests within a study should be properly reported, irrespective of their statistical significance, including the appropriate sample statistics (e.g.means, standard deviations, group sample sizes).This holds not only for non-significant findings within a study but also for studies where no significant results were found at all and that are currently often hard to get published.Only when these types of publication bias are avoided, replication studies can be properly synthesized and results judged for what they are worth.If such a publication culture could be established we can work towards accumulation of knowledge in a truly scientific way.

Approaches II and III
The pairwise t -test is based on: , and evaluated using the ( ) within T df distribution.For approach II the p -value is equal to the two-sided tail probability.For approach III, the p -value is that probability multiplied by ( ) Each planned contrast t -test is based on: ( )  The ANOVA model has the following likelihood: ( )

Prior based on training data
The method and software used for the Bayesian analysis is based on Van Wesel, Hoijtink and Klugkist, (2011).In this paper, a thorough investigation of different priors that can be used for the analysis of informative hypotheses ( ) i H in the context of ANOVA is presented.The method is based on the use of an encompassing prior, that is, a (low informative) prior is specified for the unconstrained hypothesis A H and the prior distributions for the constrained hypotheses can be derived by truncation of the prior parameter space, using:  2004, 1996;Perez and Berger, 2002).A training sample is a small part of the data that can be used to update the reference prior for the ANOVA model, 2 1 σ (Bernardo, 1979), such that the resulting posterior is proper but also low informative and objective (i.e., no subjective information is used).In the approaches described in the references above, multiple training samples are used and the results are combined in different ways.Van Wesel et al. (2011) proposed a prior that is based on the same principles but tailored for constrained hypotheses and less computer intensive (i.e., faster).This prior is called the average constrained posterior prior (ACPP).For a detailed explanation and elaborate motivation for this prior we refer to the original paper.
The general form of the ACPP is: ˆˆ, , Inv ,   Note that in this approach each = .

Posterior model probabilities
Using a uniform prior on the model space, the posterior model probabilities for t ( )

3 k
= , approaches I-IV require sample sizes ranging from 60

0 H with 1 H 1 =
provides a so-called Bayes factor (BF): 10 BF expresses how much more support the data provide for 1 implies equal support for both hypotheses, and 10 BF 1

Figure 1 .
Figure 1.Hypothesized results for the study on stereotypes and gender differences in mathematics[2], with math performance on the y-axis.

Appendix A :
Tests Used in Approaches I-V (Section 2) and A-D (Section 3)Notations used for one-way design: N = total sample size; M = overall sample mean; k = number of groups ( )

=
distribution and the one-sided tail probability (taking the hypothesized order into account).Approach V Denoting the linear contrast weight for j M with j λ the test is based on: evaluated using the ( ) within T df distribution and the one-sided tail probability (taking the hypothesized order into account).Notations used for two-way design: N = total sample size; M = overall sample mean; k = number of levels first factor ( )  ; h = number of levels second factor ( ) Two t-tests are used for the simple main effects: that the first test is evaluated with the one-sided tail probability, whereas the second is not.Approach CAll pairwise t -tests comparing group jg with j g ′ ′ are based on:and evaluated using the ( ) within T df distribution.Note that the first test with

I
is an indicator function with value one if the means are in agreement with i H , and zero otherwise.The specification of the unconstrained prior ( ) training data (Berger and Pericchi,

N
⋅ denotes the multivariate normal distribution with a mean parameter and covariance matrix, and scaled inverse chi-square distribution with the degrees of freedom and a scale parameter.PosteriorThe posterior distribution based on the ACPP is: Bayes factorsThe Bayes factor comparing two hypotheses is the ratio of two marginal likelihoods.A marginal likelihood, for instance( )A m y H , is the density of the data averaged over the prior distribution of A H . Chib (1995) noted that for the estimation of the marginal likelihood it can be useful to use the expression (imputing our choice of prior and subsequent posterior): and Hoijtink (2007)  derived that in the context of encompassing priors (i.e., the constrained model is nested in the unconstrained), the Bayes factor comparing an informative hypothesis reduces to the ratio of two proportions: the proportion of the unconstrained posterior distribution in agreement with the constraints of i H , and the proportion of the unconstrained prior distribution in agreement with the constraints of i H .These proportions are estimated using (MCMC) sampling methods.

Table 1 .
Population parameter values used for the simulation studies.

Table 2 .
Full support power for the six approaches for group sample sizes N j that provide 0.80 power for the omnibus ANOVA (for several number of groups k and effect sizes ES).
[11]ple sizes N j that provide 0.80 power for the ANOVA were determined using Gpower 3.1[11].

Table 3 .
Approximate required group sample sizes N j to obtain 0.80 full support power for each of the six approaches.

Table 4 .
Proportions of different Bayes factors for k = 3, 4, 6 groups, medium effect size, and group sample sizes N j providing power of 0.80 for the omnibus ANOVA.

Table 5 .
Comparison of the power of approaches V and VI when population means are increasing non-linearly.

Table 6 .
Comparison of the power of approaches A-E when population means are in agreement with the hypothesis of interest with increasing differences between the means (N j = 50).
= if person is a member of group j , and zero otherwise), and j µ is the mean of group j .The residuals i ε are assumed to be independent and normally distributed with mean zero and variance 2 σ .
ji d