The Analysis and Reporting of the Dundee Ready Education Environment Measure ( DREEM ) : Some Informed Guidelines for Evaluators

Background: There is a need to evaluate perceptions of the educational environment of training institutions for health professionals as part of any assessment of quality standards for education. The Dundee Ready Education Environment Measure (DREEM) is a widely used tool for evaluating the educational environment of medical and other health schools. However, methods of analysis reported in the published DREEM literature are inconsistent which could lead to misinterpretation of areas for change and, additionally, this makes comparison between institutions difficult. Those involved in course evaluation are usually not statisticians and there are no guidelines on DREEM’s reporting or statistical analysis. This paper aims to clarify the choice of methods for the analysis of the DREEM. Method: The statistical literature, typical properties of DREEM data and the results from a series of statistical simulations were used to inform our recommendations. Results: We provide a set of guidelines for the analysis and reporting of the DREEM. In particular, we provide evidence that when comparing independent samples of Likert response data similar to that generated by the DREEM, the non-parametric Wilcoxon Mann Whitney test performs well. Further, one should be wary of using non-parametric methods on matched samples of such data as they may be overly ready to reject null hypothesis. Conclusions: Our recommendations have the potential to improve the accuracy and consistency with which the inadequacies in the medical school environment can be identified and assess the success of any changes. They should also facilitate comparison between different institutions using the DREEM.


Introduction
The educational environment of a medical school is both a "manifestation of the curriculum" and a "determinant… of the behaviour of the medical school's students and teachers" (Genn, 2001a: p 342).Genn (2001b) argues that perceptions of the educational environment (the "climate") influence student satisfaction, and student achievement and success.Given its importance and the fact that the educational environment can be changed, it is imperative to measure it; and in so doing, to diagnose strengths and weaknesses that can be remediated to ensure a high quality learning experience for students.
The Dundee Ready Education Environment Measure (DREEM) was designed to measure the educational environment specifically for medical schools and schools for other health professions (Roff et al., 1997).A recent review of the literature to identify and assess instruments designed to measure the educational environment of different health professional training settings concluded that the DREEM was the most suitable instrument for the undergraduate medical education setting (Soemantri et al., 2010).The DREEM is comprised of 50 items, each with a five-point Likert response ("Strongly Agree" (4), "Agree" (3), "Unsure" (2), "Disagree" (1) and "Strongly Disagree" (0)).The items can be examined individually, combined into five subscales or a total DREEM score.Although the authors of the DREEM give guidelines for its interpretation, they do not advise on appropriate methods of statistical inference (McAleer & Roff, 2001).An extensive review of the published literature since the DREEM was introduced in Roff et al.'s 1997 publication showed that the DREEM has been widely utilised in a variety of settings at a worldwide level, indicating that it is a valued and useful tool by many health professional training institutions; however, the methods of analysis and reporting are far from consistent (Miles et al., 2012).
Our aim was to provide a set of recommendations for the analysis and reporting of the DREEM.This would enable the DREEM to be used more easily by evaluators, to more accurately identify problem areas and to facilitate comparison between institutions.However, there is controversy about how Likert data should be analysed that must be taken into account when considering how best to analyse DREEM data.
First, there is debate about the validity of taking a Likert response and treating it as numerical (see, for example, Carifio, 2007).However, the authors of the DREEM intended the item scores to be used and combined as numbers so this question can be put aside for the DREEM.Second, there is controversy as to whether it is reasonable to treat Likert response scores as continuous numerical data, also known as interval data, which opens up the possibility of using parametric methods.Jamieson (2004) provoked considerable discussion by arguing that as Likert scales are ordinal they should never be analysed using parametric methods, because parametric methods make assumptions such as the normality of the data.However, Carifio (2007Carifio ( , 2008) ) makes the important distinction between a single Likert item and a Likert scale, that is a collection of Likert items, and supports the case that it is reasonable to treat a combination of eight or more items as interval data; which would apply in the case of the whole 50 item DREEM or its multi-item subscales.Third, Carifio (2008) also argues that single items of a measurement scale should rarely be analysed alone because they form part of a "structured and reasoned whole".However, the authors of DREEM call it a "diagnostic tool" and the developers intended each item of the DREEM to be used individually to diagnose problems in that area.As such, we argue that it is valid to consider each item individually, as well as looking at the five subscales and the full DREEM instrument.
This led us to our own investigations, using a series of simulations to assess the performance of candidate statistical tests for the Likert data generated by the DREEM.Our aim was that these simulations would inform a set of recommendations for the analysis and reporting of the DREEM for current and future users of the DREEM.The investigations also have wider repercussions, in that they are applicable to Likert responses in general.

Methodology
Information from the articles reviewed by Miles et al. (2012) and unpublished student evaluation data from the Norwich Medical School, University of East Anglia (UEA) was used to identify typical distributions for the item responses.We then ran a series of simulations in Stata v8 to assess the performance, for data of this kind, of alternative tests suggested by the statistical literature.A sample size of 30 was used to reflect the conventional threshold at which a parametric test is applied to non-normal samples and 50 and 130 to represent a subgroup of a year group and a whole year group of students respectively.

The Distribution of Individual DREEM Responses
Data from UEA and research publications suggest that a common distribution of responses for a single DREEM item is 50% -70% Agreeing, 40% -20% Strongly Agreeing with the remaining small percentage spread between Strongly Disagree, Disagree and Unsure resulting in a skewed distribution.Further, as Till (2004) points out, a great number of items have bimodal distributions, that is, a high percentage disagree and a high percentage agree giving "mixed messages".Another common occurrence is to observe a very high percentage of Unsure answers, with smaller percentages agreeing or disagreeing.Any method of reporting and analysis must therefore be suitable for all these types of distribution.
The Uses of the DREEM Miles et al. (2012) identified three main uses of the DREEM for evaluation purposes.First, it is used as a diagnostic tool; that is to highlight elements of a course/curriculum which are currently unsatisfactory and need remediation.Second, it can be used to compare two or more completely separate groups of students, for instance, males with females or one year group with another.More generally this is known as the independent samples case.Third, it is used to compare the same group of students on different occasions; the matched case.This might be, for instance, to compare a cohort's experiences from one academic year to another or alternatively to compare a group of students' scores with their "ideal" or "expected" score.We will consider each of these in turn.

The DREEM as a Diagnostic Tool Considerations
The developers suggest reporting mean scores across all participants for each of the 50 items separately.If using the DREEM for purely diagnostic purposes examination of these means will indicate areas of strength and weakness.Individual items with a mean score of ≥3.5 are particularly strong areas, items with a mean score of ≤2.0 need particular attention, and items with mean scores between 2 and 3 are areas of the educational environment that could be improved (McAleer and Roff, 2001).
Recommendations It is certainly meaningful to use means rather than medians because the median can only take one of the five possible scores.However, for skewed or bimodal distributions, which commonly occur in the DREEM, an item with an acceptable central measure may still mask a high proportion of negative responses, so this alone does not seem adequate.We therefore suggest reporting a table of results which summarises the responses by merging the Agree/Strongly Agree, Disagree/ Strongly Disagree categories and reports the mean.Further we propose using a series of warnings or "flags", with thresholds decided a priori to alert to items with a low percentage agreement, a high percentage unsure and/or a high percentage disagree as well as means below a particular level, say 2.0 as recommended by the developers or 2.5 if one wants to be stricter.Given that many items give skewed responses the standard deviation can mislead, so we do not recommend its inclusion.
An example for one of the DREEM's five subscales using data from Year 1 UEA medical students can be seen in Table 1.We have flagged in bold those items where less than 50% of students Agree/Strongly Agree, more than 30% are Unsure and more than 20% Disagree/Strongly Disagree.Notice that flags occur on the items "Last year's work has been a good preparation for this year's work" and "I am able to memorize all I need".Whilst the item "Last year's work has been a good preparation for this year's work" has a low but acceptable mean of 2.5 the "flag" system draws attention to the fact that less than 50% of respondents agree and nearly all the others are unsure suggesting that this is an item that needs attention from the teaching team.However, in this case we would not necessarily expect first year students to feel that the work they had done last year (for instance A levels, an Access to Medicine course or employment) was a good preparation for their first year of medical school and there is no cause for concern.This illustrates the importance of interpreting the DREEM scores according to their unique situational context at each educational institution.In contrast, the flag for the item "I am able to memorize all I need" suggests that there may be a concern about workload or learning strategies that the teaching team might need to look into.

Comparing Two Independent Samples Considerations
The second objective of the DREEM is to compare two completely separate or independent groups of students.Till (2004) compares groups of males and females using the independent samples t test, whereas Miles and Leinster (2009) use the Wilcoxon Mann Whitney test to compare staff and student perceptions of the educational environment.
The independent samples t test is the classical parametric method of comparing two populations.The textbook view requires that the data come from a normal distribution, unless the sample size n is "large" (conventionally at least 30).Distributions that are severely non-normal, as can occur for DREEM data, will, in general, require bigger samples for the t test to be appropriate.
When the t test is not appropriate the corresponding non-parametric test, the Wilcoxon Mann Whitney (WMW) test is often used.However, even this test requires some assumptions.In particular it requires that both samples come from probability distributions with a similar shape, but possibly a different "centre".This is unlikely with Likert response data, such as the DREEM with its five response options, because there are only a few possible values.Additionally, WMW is based on ranking (ordering) the data and as such ties in the ranks (i.e.equal values), which are quite likely when there are only a few possible values, can affect the outcome.
In the statistical literature there is a long-standing debate on whether the t test or WMW test should be used to compare two independent samples when the data are non-normal.A "good" test should deliver the significance level it is theoretically supposed to (usually 5%) and also have "good" power; that is, a high chance of spotting deviations from the null hypothesis, for instance, of spotting a real difference between two populations.Glass (1972) cites empirical evidence that, even if the distribution is quite skewed or has very fat tails (high kurtosis) and even for a five point Likert response, the t test has an actual significance level which is similar to the one calculated for normally distributed data, even for small samples.Also, he cites evidence that the power of a t test used on non-normal data might be slightly higher than the "normal" equivalent for mid-range powers like 0.1 to 0.7 and only slightly worse for larger powers closer to 1.He therefore advocates using para-metric tests in most cases.Blair (1981) argues that the issue should not be whether the t test preserves the significance level and power calculated under the normality assumption, but whether there is another test which has greater power.Nonparametric tests are known to have slightly worse power than the t test when the data are normal but they can have much bigger power when the data are non-normal, in particular when the data are skewed.In particular, for large samples the WMW test never has worse power than the analogous t test performed on samples of 0.864 × the sample size but can, in some circumstances (usually a skewed distribution), have equivalent power to the t test on samples three times bigger.This evidence largely applies to continuous distributions and it is not clear to what extent it applies to Likert responses, in particular those commonly generated by the DREEM.Norman (2010) advocates the wider use of parametric tests for Likert responses and cites several studies (including some of those cited here) which show that parametric tests give accurate results for particular types of skewed or ordinal data.However, he does not consider the possibility that the power may be larger using the corresponding non-parametric test.

Simulation
To address this issue we simulated a pair of samples from two different Likert response distributions 10,000 times.We did a t and a WMW test on each pair of samples using a 5% significance level.The number of times a test (correctly) detected a difference divided by 10,000 gives an estimate of the actual or achieved power of each test.We also simulated 10,000 pairs of samples from a single Likert response distribution, i.e. no difference between distributions, and performed the same two tests.The proportion of pairs which (falsely) detected a difference gives an estimate of the achieved significance level of the tests.We repeated the process on several pairs of distributions chosen to reflect patterns found in actual DREEM data including varying degrees of skewness, bimodal and high percentage of Unsure responses (see Appendix, Table A).
The results of these simulations suggest that for the more symmetric distributions the power of the t test and WMW are similar.However, when one or both distributions are skewed the WMW can have substantially greater power than the t test for lower sample sizes and sometimes even for n = 130.For instance, when comparing two distributions of 20%/60%/10%/ 8%/2% (i.e.DREEM data where 20% of the students Strongly Agree, 60% Agree, 10% Unsure, 8% Disagree, and 2% Strongly Disagree) and 40%/40%/10%/8%/2% respectively for a sample size of 130 in each group the t test had an estimated achieved power of 40% and the WMW 68% (simulation 3 of Table A).
We should emphasise (illustrated in the final simulation of Table A) that these tests cannot detect different distributions if the mean/medians are similar.We therefore suggest comparing the percentages of respondents who disagree (i.e.Disagree/ Strongly Disagree) using a chi squared test.Note that chi squared tests comparing three or more categories between groups are not appropriate as the data are ordinal, not nominal.Power calculations using standard sample size software suggest that it is feasible to use a chi squared analysis on a whole year group of students (n = 130) but not on sub-groups within a year group.For instance (using nquery), if in one year 50% of respondents Disagreed/Strongly Disagreed a chi squared test to detect a 20 percentage point difference the following year would have a power of 91% for n = 130 but only 53% for n = 50.

Multiple tests
If every DREEM item is analysed individually 50 separate significance tests will be performed.If the significance level is 5%, it can be shown mathematically that there is a 92% chance that at least one is significant, when no real difference exist.A classical solution to this, known as Bonferroni's correction, is to divide the significance level by the number of tests.However, this is known to be conservative and it increases the probability of missing a real difference.Another school of thought advocates reducing the number of outcomes under study and interpreting the results of statistical tests in the context of the quality of the study and the size of the finding (e.g.Feise, 2002).For the DREEM this might mean including in the main analysis only those items identified previously as requiring remedial action.
Recommendations Table 2 demonstrates our recommendations, informed by the simulations, for comparing two independent samples of DREEM responses.It uses data from UEA Year 1 and Year 2 medical students on the DREEM's Academic self perceptions subscale.We suggest reporting the results of the DREEM in a table summarising the responses using the percentage Strongly Agree/Agree; Unsure, and Strongly Disagree/Disagree for each group, the two means, the mean difference and then the results of both a t test and a Wilcoxon Mann Whitney test.We would also include a chi squared test of the difference in the percentage who Strongly Disagree/Disagree (it would be equally valid to do a chi squared test of the difference in the percentage who Strongly Agree/Agree).A rule of thumb for the validity of the chi squared test is that np and n(1p), where p is the observed proportion over both groups, are both 5 or more.We therefore suggest exercising caution and not performing the test where an observed percentage is, say, less than 5%.Significance on any test would be flagged, without any adjustment for multiple comparisons.And, as in the diagnostic Table 1, low percentage agreement, high unsure, high disagreement and low means would also be flagged.
Notice that both the t and WMW tests are significant for the items "Much of what I have to learn seems relevant to a career in healthcare", "I am able to memorize all I need", "I am confident about passing this year" and "Learning strategies which worked for me before continue to work for me now"; but for the item, "Last year's work has been a good preparation for this year" the WMW is highly significant whereas the t test is not significant.On inspection this latter item is highly skewed which explains why WMW has detected a difference but the t test has not, as suggested by the simulations.

Comparing Two Matched Samples
Considerations Matched samples arise when two sets of responses are obtained for the same group of individuals, for instance at two separate points in time; the scores of interest are the set of change scores.For DREEM, matched data also arise when student expectations of the environment are compared with Copyright © 2013 SciRes.
their actual perceptions at the end of that year (e.g.Miles & Leinster, 2007).The amount by which the actual scores fallshort of the expected is termed the "dissonance".Till (2005) reports items with the largest dissonance and uses the paired sample t test.Miles and Leinster (2007) report the average dissonance for each item of the DREEM and then use a Wilcoxon Signed Rank (WSR) test to test whether the subscales have zero median dissonance.The paired samples t test is equivalent to a single sample t test in that the changes have zero mean.It assumes that the changes are normally distributed but, as for the independent samples t test, this condition can be waived for "large" samples.The WSR is a non-parametric test, but still assumes that the distribution of the changes is symmetric.Glass (1972: p. 262) gives a table from Srivastava (1959) reporting the theoretical power of the t test if it is conducted on small samples of data (n = 10) with various types of non-normality.The power, unless it is low, is very similar to that of normal data; supporting the use of the t test.

Simulation
To investigate the power of the two types of test we simulated 10,000 samples from each of four possible change distributions.These distributions were chosen to be typical of the distributions of the changes and dissonances found in actual DREEM data and to have non-zero means and varying degrees of symmetry/skewness (see Appendix, Table B).Again the proportion of simulated samples which detect a non-zero mean change gives an estimate of the power of each of the tests.The results suggest that the two types of test have similar power for more symmetric distributions but the WSR has slightly better power for skewed distributions unless the power approaches 100%.For instance, if 10% of the changes are −2, 30% are −1, 50% are 0 and 5% are 1 and 2, i.e. a skewed distribution with effect size about 0.4, the power of the t test is 75% and of the WSR is 85% for a sample size of 50 (simulation 3 of Appendix, Table B).
The achieved significance level of these tests depends on the exact distribution of the changes under the null hypothesis of a zero mean/median.To estimate this we simulated 10,000 samples from several zero mean distributions with varying skewness (see Appendix, Table C).The results indicated that for the symmetric distributions both tests give achieved significance levels which are approximately 5% as desired.However, for skewed distributions the WSR test appears more likely to incorrectly detect a change than it should be.For instance, for a moderately skewed distribution (40% of the changes are 1, 30% zero, 20% −1 and 10% −2) 8.8% of samples of size 130 give a significant results when the WSR is used, but only 5.3% with the t test (simulation 3 of Appendix, Table C).
Note that the chi squared test is not a valid test to compare percentages of matched data as the same students are contributing scores into both data sets.McNemar's test of equal proportions is appropriate (e.g.Agresti 2002, page 411).
Recommendations These findings lead us to suggest producing a similar table to

Subscales and Total Scores
Subscale scores of the DREEM are constructed by adding up responses from the seven to twelve individual items making up the subscale.As with the individual items, the developers give guidance on interpreting the score for each subscale and total (McAleer & Roff, 2001) but none on statistical inference.Statistically, whilst sums of independent items are likely to be "more" normally distributed than the items themselves, items which have been grouped into subscales are likely to be mutually correlated and so there may still be strong non-normality.We therefore advocate treating the subscale results in much the same way as the individual items; that is performing both t and non-parametric tests on independent samples case but only t tests on matched samples.However, as subscale scores can take a large number of possible values the median could be reported as well as the mean.For consistency of presentation we would recommend reporting total DREEM scores in a similar way.

Discussion and Conclusion
Methods for the analysis and reporting of the DREEM have not been consistent in the medical education research literature and more generally there has been controversy on how Likert response data should be analysed.The results of our simulations have led to these guidelines for the analysis and reporting of DREEM data.However, the results of our simulations are applicable to Likert responses in general and support the view that when comparing independent samples, in particular those from skewed or bimodal distributions, the non-parametric WMW test performs well and may have greater power than the t test.However, one should be wary of using non-parametric methods on matched samples as they may be overly ready to reject null hypotheses.
We have not explicitly considered the comparison of three or more independent samples (for example DREEM data from all years of five year medical course).The selection of three or more distributions for simulation under the alternative hypothesis is impractical as there is a plethora of possibilities; so we have not run simulations for such comparisons.However, our view would be to use the analogue of the independent samples t test and WMW tests; that is analysis of variance and the nonparametric equivalent, Kruskall Wallis, in a similar way to the two sample situation.
The recommendations we have given will make it easier for those involved in evaluation to report and analyse the DREEM.This should allow medical schools to use the DREEM to more accurately identify areas for change and assess the success of consequent changes.Further, greater standardisation of method should facilitate comparison between medical schools.More generally, the simulation results add to the understanding of how to analyse individual Likert responses, a subject of some contention.

Appendices Table A.
Estimated achieved significance level and power of independent two sample tests (p = 0.05) based on 10,000 simulations.

Table 1 .
Example of a diagnostics table.Academic self perceptions subscale: A Year 1 cohort of UEA medical students.n = 147 unless otherwise specified.Less than 50% Agree/Strongly Agree; More than 30% Unsure; More than 20% Disagree/Strongly Disagree.Mean less than 2.5.

Table 2 .
Example of a table for comparing two independent samples.Academic self perceptions subscale comparing two different cohorts of UEA medical students.

1 Year 2 DREEM Item n SA/A Unsure SD/D n SA/A Unsure SD/D Chi sq (SD/D) Mean Mean T test WMW
SA/A = Strongly Agree/Agree; SD/D = Strongly Disagree/Disagree; Chi square test between percentage Strongly disagree/Disagree where both percentages are >5% only; Flags: Less than 50% Agree/Strongly Agree; More than 30% Unsure; More than 20% Disagree/Strongly Disagree.Mean less than 2.5.

Table 2
(for comparing two independent samples) for matched data but reporting only the t test and using McNemar instead of the chi squared test (example table not provided due to the similarity to Table2).

Table B .
Estimated achieved power of matched two sample tests using 10,000 simulations (p = 0.05).ES = Effect size is mean divided by standard deviation.Skew = The skewness coefficient of the distribution.0 is symmetric; WSR = Wilcoxon Signed Rank.

Table C .
Estimated achieved significance level from 10,000 simulations when comparing matched samples.