Using Excel to Explore the Effects of Assumption Violations on One-Way Analysis of Variance (ANOVA) Statistical Procedures

To understand any statistical tool requires not only an understanding of 
the relevant computational procedures but also an awareness of the assumptions 
upon which the procedures are based, and the effects of violations of these 
assumptions. In our earlier articles 
(Laverty, Miket, & Kelly [1]) and (Laverty & Kelly, [2] [3]) we used Microsoft Excel to simulate both a Hidden Markov model and 
heteroskedastic models showing different realizations of these models and the 
performance of the techniques for identifying the underlying hidden states 
using simulated data. The advantage of using Excel is that the simulations are 
regenerated when the spreadsheet is recalculated allowing the user to observe 
the performance of the statistical technique under different realizations of 
the data. In this article we will show how to use Excel to generate data from a 
one-way ANOVA (Analysis of Variance) model and how the statistical methods 
behave both when the fundamental assumptions of the model hold and when these 
assumptions are violated. The purpose 
of this article is to provide tools for individuals to gain an intuitive 
understanding of these violations using this readily available program.

quires that the population being sampled is a normal distribution and that the observations in the sample are independent. If these underlying assumptions do not hold, the desired performance of the statistical procedure may no longer hold true. Sometimes the effect of an invalid assumption on a property of the procedure is minimal, sometimes not so. If the population is non-normal but has a finite mean and variance (such that the Law of Large Numbers and the Central Limit theorem applies), the departure from normality will have little effect on the properties of confidence intervals computed assuming normality when the sample size is adequately large. The reason for this is that it is a consequence of the Central Limit Theorem. The purpose of this paper is to show how to use the program Excel to simulate data for which the statistical technique of one-way Analysis of Variance (ANOVA) is used. The advantage of the using the program Excel is that when you press the recalculate button, under the Formulas menu, the data that is generated at random will be regenerated, statistical calculations will be recalculated and relevant graphs will be redrawn. This allows the user to observe the variation in these procedures for different realizations of the data. See Figure 1.

A Model for Non-Normality (The Cauchy Distribution, the t-Distribution)
For most cases when one-way ANOVA is applicable the normality assumption is   The Standard Cauchy distribution is equivalent to the t distribution with 1 degree of freedom. A graph of the standard normal distribution, the t-distribution with 5 degrees of freedom, and the Cauchy distribution is in Figure 2.
The Cauchy Distribution is an example of a distribution where the Law of Large numbers and the Central limit Theorem do not apply [4]. In order for these two Laws to hold both the mean and higher moments have to exist and be finite. This is not the case for the Cauchy distribution. There is no convergence of the distribution of the sample mean to the central value. In fact the distribution of the sample mean is the Cauchy distribution for any sample size (i.e. the distribution of the sample mean is the same as that of any individual observation when the data comes from the Cauchy distribution). The Cauchy distribution is a heavy-tailed distribution. The t-distribution is also a heavy-tailed distribution (but not as extreme) when the degrees of freedom ν is small. As the degrees of freedom increases the t distribution approaches the standard normal distribution. Tsay [5] uses the t-distribution with 5 degrees to model random disturbances that appear in various time series models of financial data. This accounts for the sometimes extreme changes that appear in financial data. The Cauchy distribution is appropriate if extreme values are prevalent in the data (the t-distribution with degrees of freedom higher than 1 in the less extreme case).
This could occur in surveys where individuals were asked to make a continuous measurement of some quantity and extreme values were prevalent in the populations. For example, measurements of blood pressure, IQ, and performance of a political leader, could result in non-normal data with extreme values at either end. In such cases alternatives to ANOVA are appropriate. 1 We haven't considered these alternatives in this paper.
The t-distribution with ν degrees of freedom can also be shown to be mixture of Normal distributions with mean 0 and variance W, where the weighting distribution for W is the inverse gamma distribution with α = ν/2 and β = ν/2 (Cook [6]). This implies that a random variable T will have the t-distribution with ν degrees of freedom if W is selected from the inverse gamma distribution with α = ν/2 and β = ν/2 and then T is selected from Normal distributions with mean 0 and variance W.

Simulation of Data from a Continuous Distribution in Excel
Uniform random variates on [0, 1] can be generated in Excel with the function "RAND()". The generation of random variates from a continuous distribution with measure of central location μ and measure of scale σ, can be carried out using the inverse-transform method (Fishman [7]). Namely Y = F −1 (U) where F(u) is the desired cumulative distribution of Y and U has a uniform distribution on [0, 1] (see Figure 3). In Excel this is achieved for the Normal distribution (mean μ, standard deviation σ) with the function "μ + σ* NORMSINV(RAND())" and for the Cauchy (t with 1 d.f.) location parameter, μ, and scale parameter, σ, "μ + σ* TINV(2*(1-RAND()),1)" (Figure 3). Comment: The Excel function TINV(U,df) does not calculate F −1 (U) for the t-distn with degrees of freedom df, however the excel function TINV(2*(1-U),df) does achieve the desired calculation.

Setting Up the Excel Worksheet to Simulate Anova Data
The data simulated will come from 3 populations (this can easily be generalized to more than 3 populations). The parameters of the populations 1) mean (central location), stored in cells C2:E2 2) standard deviation (scale parameter), stored in cells C3:E3 3) sample size), stored in cells C4:E4 4) a parameter that determines normality of the data versus non-normality. stored in cells C1:E1. This parameter is set to zero if the desired data is normal. If this parameter is set to an integer, ν, greater than 0 the data will come from a t -distribution with ν degrees of freedom. The t -distribution is a non-normal heavy-tailed, centered and symmetric about zero. 5) A final parameter (precision), located in cell A2 specifies the of decimal places that the raw data is rounded to ( Table 1

Generating Simulated Data
Copy the observation numbers (1 to 10) in Cells B7:B:16 Paste in cell C7 the formula =IF($B7>C$4,"",ROUND(C$2+C$3*IF(C$1=0,NORMSINV(RAND()),TINV (2*(1-RAND()),C$1)),$A$2))" Copy this formula to cells C7:E16. If the normality parameter is 0, the data generated will be from the normal distribution with mean = "loc. Par." And standard deviation = "scale par."). If the normality parameter is an integer greater than 0, the data will be a random number with a t-distribution scaled by the "scale par." and location shifted by the "loc. par." The data will be rounded to the number of decimals specified by "precision".

Computation of Statistics Required for One-Way ANOVA
Suppose we have data from k Normal populations with means where ( ) This statistic has an F-distribution with ν 1 = k -1 degrees of freedom in the numerator and ν 2 = N -k degrees of freedom in the denominator.
The computing formulae for 2 2 The testing for One-way ANOVA is carried out using the Analysis of Variance table (Table 2). Place the formula "=SUM(C18:E18)" in cell G18 to compute the grand total,  can be placed in cell N22.
The formula for a (1 − α)100% confidence interval for the mean of the ith sample is: To construct Box-whisker plots of the data 1) Select a range containing the data C6:E16 for 10 observations from each sample from the 3 Populations.
2) The menu item for Box-plots can be found under the histogram item ( Figure 5).  Comment: There is a problem with Excel's method of drawing box-plots.
If in the data range there is a blank cell, when drawing a box-plot Excel treats that cell as containing a zero rather than treating the observation as non-existent.

Exercises That Can Be Performed to Illustrate the Effects of Assumption Violations on ANOVA
In these exercises we generate samples using different ANOVA assumptions to examine the violations of these assumptions on the ANOVA calculations.

Discussion
In applying any statistical procedure it is important understanding the assumptions on which it is based. It is also important to understand the effects on these procedures of the violations of these assumptions. Sometimes the effects of the violations can be extreme, sometimes minimal. The purpose of this article is to provide tools for individuals to gain an intuitive understanding of these violations using the readily available program Microsoft Excel. The advantage of the using the program Excel is that when you press the recalculate button, under the Formulas menu, the data that is generated at random will be regenerated, statistical calculations will be recalculated and relevant graphs will be redrawn. The statistical procedure that we have chosen to illustrate these tools is one-way ANOVA. This procedure is an important component of introductory statistical courses and textbooks. The tools can be easily extended to other and more advanced univariate procedures.

Conclusion
Excel is a very useful tool for examining the performance of One-Way Anova of variance both when the assumptions hold and more importantly when the assumptions are violated.