^{1}

^{*}

^{2}

To understand any statistical tool requires not only an understanding of the relevant computational procedures but also an awareness of the assumptions upon which the procedures are based, and the effects of violations of these assumptions. In our earlier articles (Laverty, Miket, & Kell y [1] ) and (Laverty & Kelly, [2] [3] ) we used Microsoft Excel to simulate both a Hidden Markov model and heteroskedastic models showing different realizations of these models and the performance of the techniques for identifying the underlying hidden states using simulated data. The advantage of using Excel is that the simulations are regenerated when the spreadsheet is recalculated allowing the user to observe the performance of the statistical technique under different realizations of the data. In this article we will show how to use Excel to generate data from a one-way ANOVA (Analysis of Variance) model and how the statistical methods behave both when the fundamental assumptions of the model hold and when these assumptions are violated. The purpose of this article is to provide tools for individuals to gain an intuitive understanding of these violations using this readily available program.

An important aspect of any statistical procedure is the assumptions that the procedure is based on. For example, using the t-distribution to calculate a 95% confidence interval for the centre of the population that is being sampled requires that the population being sampled is a normal distribution and that the observations in the sample are independent. If these underlying assumptions do not hold, the desired performance of the statistical procedure may no longer hold true. Sometimes the effect of an invalid assumption on a property of the procedure is minimal, sometimes not so. If the population is non-normal but has a finite mean and variance (such that the Law of Large Numbers and the Central Limit theorem applies), the departure from normality will have little effect on the properties of confidence intervals computed assuming normality when the sample size is adequately large. The reason for this is that it is a consequence of the Central Limit Theorem. The purpose of this paper is to show how to use the program Excel to simulate data for which the statistical technique of one-way Analysis of Variance (ANOVA) is used. The advantage of the using the program Excel is that when you press the recalculate button, under the Formulas menu, the data that is generated at random will be regenerated, statistical calculations will be recalculated and relevant graphs will be redrawn. This allows the user to observe the variation in these procedures for different realizations of the data. See

For most cases when one-way ANOVA is applicable the normality assumption is appropriate, i.e. the departures of individual observations from their central value are normally distributed. There are however, many examples where this is not the case and extreme departures are more prevalent than predicted by the Normal distribution. This would be dependent on the measurements being collected. For example, if the measurements were measurements of blood pressure, IQ, performance of a political leader one may expect the presence of extreme measurements. In such cases an appropriate model of the departures from the central value would be the t-distribution (a heavy tailed distribution). In this article the reader can use the technique provided to explore the effects of sampling from heavy tailed distributions on ANOVA calculations that assume normality.

The probability density function of the standard Normal, Students t-distribution with ν degrees of freedom and the standard Cauchy distribution is given in (1).

f Normal ( z ) = 1 2 π e − z 2 f t ( t : ν ) = Γ ( ν + 1 2 ) ν π Γ ( ν 2 ) ( 1 + t 2 ν ) − ν + 1 2 f Cauchy ( x : 0 , 1 ) = 1 π ( 1 + x 2 ) } (1)

The Standard Cauchy distribution is equivalent to the t distribution with 1 degree of freedom. A graph of the standard normal distribution, the t-distribution with 5 degrees of freedom, and the Cauchy distribution is in

The Cauchy Distribution is an example of a distribution where the Law of Large numbers and the Central limit Theorem do not apply [

The t-distribution with ν degrees of freedom can also be shown to be mixture of Normal distributions with mean 0 and variance W, where the weighting distribution for W is the inverse gamma distribution with α = ν/2 and β = ν/2 (Cook [_{ }

Uniform random variates on [0,1] can be generated in Excel with the function “RAND()”. The generation of random variates from a continuous distribution with measure of central location μ and measure of scale σ, can be carried out using the inverse-transform method (Fishman [^{−1(}U) where F(u) is the desired cumulative distribution of Y and U has a uniform distribution on [0,1] (see

Comment:The Excel function TINV(U,df) does not calculate F^{−1}(U) for the t-distn with degrees of freedom df, however the excel function TINV(2*(1-U),df) does achieve the desired calculation.

The data simulated will come from 3 populations (this can easily be generalized to more than 3 populations). The parameters of the populations

1) mean(central location), stored in cells C2:E2

2) standarddeviation (scale parameter), stored in cells C3:E3

3) samplesize), stored in cells C4:E4

4) aparameter that determines normality of the data versus non-normality. stored in cells C1:E1. This parameter is set to zero if the desired data is normal. If this parameter is set to an integer, ν, greater than 0 the data will come from a t -distribution with ν degrees of freedom. The t -distribution is a non-normal heavy-tailed, centered and symmetric about zero.

5) A final parameter (precision), located in cell A2 specifies the of decimal places that the raw data is rounded to (

A | B | C | D | E |
---|---|---|---|---|

precision | normality | 0 | 0 | 0 |

2 | loc. par. | 10 | 10 | 15 |

scale par. | 3 | 3 | 3 | |

n | 10 | 10 | 10 |

Copy the observation numbers (1 to 10) in Cells B7:B:16

Paste in cell C7 the formula =IF($B7>C$4,"",ROUND(C$2+C$3*IF(C$1=0,NORMSINV(RAND()),TINV(2*(1-RAND()),C$1)),$A$2))”

Copy this formula to cells C7:E16. If the normality parameter is 0, the data generated will be from the normal distribution with mean = “loc. Par.” And standard deviation = “scale par.”). If the normality parameter is an integer greater than 0, the data will be a random number with a t-distribution scaled by the “scale par.” and location shifted by the “loc. par.” The data will be rounded to the number of decimals specified by “precision”.

For each population compute T_{i} = Sxand Sx^{2}. Paste formula “=SUM(C7:C16)” and formula “=SUMSQ(C7:C16)” in cells C18 and C19. Copy these formulae to cells C18:E19.

Suppose we have data from k Normal populations with means μ 1 , μ 2 , μ 3 , ⋯ , μ k and common standard deviation σ.Let { x i j , i = 1 , 2 , ⋯ , k ; j = 1 , 2 , ⋯ , n i } denote data from these populations. Let x_{ij} = the j^{th} observation from the i^{th} population, n_{i} = the sample size from the i^{th} population.

Let

x ¯ i = ∑ j = 1 n i x i j n i and s i = ∑ j = 1 n i ( x i j − x ¯ i ) 2 n i − 1 (2)

denotethe sample mean and standard deviation from the i^{th} population. To compute the sample mean and sample Standard deviation for each population, paste the formulae “=AVERAGE(C7:C16)”and “=STDEV(C7:C16)”in cells C21 and C22. Copy these formulae to cells C21:E22.

To test the null hypothesis H_{0}: μ 1 = μ 2 = ⋯ = μ k against H_{A}: μ i ≠ μ j for at least one pair i, j we use the test statistic

F = ∑ i = 1 k ( x ¯ i − x ¯ . ) 2 / ( k − 1 ) ∑ i = 1 k ∑ j = 1 n i ( x i j − x ¯ i ) 2 / ( N − k ) = SS Between / ( k − 1 ) SS Within / ( N − k ) . (3)

where

SS Between = ∑ i = 1 k ( x ¯ i − x ¯ . ) 2 and SS Within = ∑ i = 1 k ∑ j = 1 n i ( x i j − x ¯ i ) 2 (4)

This statistic has an F-distribution with ν_{1} = k – 1 degrees of freedom in the numerator and ν_{2} = N – k degrees of freedom in the denominator.

The computing formulae for

SS Between = ∑ i T i 2 n i − G 2 N and SS Within = ∑ i ∑ j x i j 2 − ∑ i T i 2 n i (5)

where

T i = ∑ i x i j = ∑ ∑ x i j and G = ∑ i T i = ∑ ∑ x i j (6)

The testing for One-way ANOVA is carried out using the Analysis of Variance table (

Place the formula “=SUM(C18:E18)” in cell G18 to compute the grand total, G = ∑ i T i = ∑ ∑ x i j and the formula “=SUM(C19:E19)” in cell G19 to compute ∑ ∑ x i j 2 .

Place the formula “=C18^{2}/C4” in cell C24 and copy to E24 to compute T i 2 n i for each sample. Then place the formula “=SUM(C24:E24)” in cell G24 to compute ∑ i T i 2 n i .

To compute SS Between = ∑ i T i 2 n i − G 2 N place the formula “=G24-G18^{2}/F4” in cell J22 and to compute SS Within = ∑ i ∑ j x i j 2 − ∑ i T i 2 n i place the formula = G19-G24” in J23.

Source | d.f. | Sum of Squares | Mean Square | F | Significance |
---|---|---|---|---|---|

Between | k− 1 | SS_{Between} | MS_{Between} | MS_{Between}/MS_{Within} | p-value |

Within | N− k | SS_{Within} | MS_{Within} | ||

Total | N− 1 | SS_{Total} | MS_{Total} |

The formulae for degrees of freedom, Mean Square can be placed in the appropriate cells L22:L23 and K22:K23.

The formula for the F statistic “=L22/L23” can be placed in cell M22. The formula for the p-value of the observed F value “=FDIST(M22, K22,K23)” can be placed in cell N22.

The formula for a (1−α)100% confidence interval for the mean of the ith sample is:

x ¯ i ± t α / 2 ( d f Error ) MS Error n i (7)

This formula “=C$21-TINV(0.05,$K$23)*(SQRT($L$23)/$C$4)” can be placed in Cell I28 for the lower limit and in cell I29 “=C$21+TINV(0.05,$K$23)*(SQRT($L$23)/$C$4)” for the upper limit. These formulae can be copied to cells I28:K29 to do the computation for all samples.

The spreadsheet should now look like

To construct Box-whisker plots of the data

1) Select a range containing the data C6:E16 for 10 observations from each sample from the 3 Populations.

2) The menu item for Box-plots can be found under the histogram item (

Comment: There is a problem with Excel’s method of drawing box-plots. If in the data range there is a blank cell, when drawing a box-plot Excel treats that cell as containing a zero rather than treating the observation as non-existent.

In these exercises we generate samples using different ANOVA assumptions to examine the violations of these assumptions on the ANOVA calculations.

1) Equal means, Equal Standard deviations, Equal sample size, Normality:

μ_{1}= 10, σ_{1} = 2, n_{1} = 10,μ_{2}= 10, σ_{2} = 2, n_{2} = 10, μ_{3} = 10, σ_{3} = 2, n_{3} = 10; normality = 0 (normal distribution)

Comment: When the population means are all equal and the assumptions are satisfied the p-values come from a uniform distribution from 0 to 1. Thus 5% of the time the p-value will be less than or equal to 0.05 resulting in a type I error.

2) Unequal means (H_{0} false), Equal Standard deviations, Equal sample size, Normality:

μ_{1}= 15, σ_{1} = 2, n_{1} = 10,μ_{2}= 10, σ_{2} = 2, n_{2} = 10, μ_{3} = 5, σ_{3} = 2, n_{3} = 10; normality = 0 (normal distribution)

Comment: The ability to detect differences among the means will depend on the non-centralityparameter δ = ∑ i n i ( μ i − μ ) 2 σ 2 where μ = ∑ i n i μ i ∑ i n i .

(Kirk, [

3) Unequal means (H_{0} false), Equal Standard deviations, Equal sample size, Normality (low non-centrality parameter):

μ_{1}= 11, σ_{1} = 5, n_{1} = 10,μ_{2}= 10, σ_{2} = 5, n_{2} = 10, μ_{3} = 9, σ_{3} = 5, n_{3} = 10; normality = 0 (normal distribution)

Comment: In this case the non-centrality parameter is smaller than the previous example. The p-value of the F-test is considerably higher resulting in an inability to detect a difference in the means.

4) Equal means, Unequal Standard deviations, Equal sample size, Normality:

μ_{1}= 10, σ_{1} = 2, n_{1} = 10,μ_{2}= 10, σ_{2} = 5, n_{2} = 10, μ_{3} = 10, σ_{3} = 10, n_{3} = 10; normality = 0 (normal distribution))

Comment: The anova F-test is to some extent robust against the violation of the assumption of the homogeneity of variance (Bathke [

5) Equal means, Equal Standard deviations, Equal sample size, non-Normality:

μ_{1}= 10, σ_{1} = 2, n_{1} = 10,μ_{2}= 10, σ_{2} = 2, n_{2} = 10, μ_{3} = 10, σ_{3} = 2, n_{3} = 10; normality = 1 (Cauchy distribution)

Comment: Recall when the data comes from the Cauchy distribution (t-distribution 1 d.f.) neither the law of large numbers or the Central Limit Theorem are applicable. In fact, the distribution of the sample mean for n observations is the same as a single observation. This is illustrated in this example.

In applying any statistical procedure it is important understanding the assumptions on which it is based. It is also important to understand the effects on these procedures of the violations of these assumptions. Sometimes the effects of the violations can be extreme, sometimes minimal. The purpose of this article is to provide tools for individuals to gain an intuitive understanding of these violations using the readily available program Microsoft Excel. The advantage of the using the program Excel is that when you press the recalculate button, under the Formulas menu, the data that is generated at random will be regenerated, statistical calculations will be recalculated and relevant graphs will be redrawn. The statistical procedure that we have chosen to illustrate these tools is one-way ANOVA. This procedure is an important component of introductory statistical courses and textbooks. The tools can be easily extended to other and more advanced univariate procedures.

Excel is a very useful tool for examining the performance of One-Way Anova of variance both when the assumptions hold and more importantly when the assumptions are violated.

The authors declare no conflicts of interest regarding the publication of this paper.

Laverty, W. and Kelly, I. (2019) Using Excel to Explore the Effects of Assumption Violations on One-Way Analysis of Variance (ANOVA) Statistical Procedures. Open Journal of Statistics, 9, 458-469. https://doi.org/10.4236/ojs.2019.94031